next up previous
Next: 5. Numerical Implementation and Up: Lagrangian Support Vector Machines Previous: 3. LSVM (Lagrangian Support


4. LSVM for Nonlinear Kernels

In this section of the paper we show how LSVM can be used to solve classification problems with positive semidefinite nonlinear kernels. Algorithm 1 and its convergence can be extended for such nonlinear kernels as we show below. The only price paid for this extension is that problems with large datasets can be handled using the Sherman-Morrison-Woodbury (SMW) identity (1) only if the inner product terms of the kernel [16, Equation (3)] are explicitly known, which in general they are not. Nevertheless LSVM may be a useful tool for classification with nonlinear kernels because of its extreme simplicity as we demonstrate below with the simple MATLAB code for which it does not make use of the Sherman-Morrison-Woodbury identity nor any optimization package.

We shall use the notation of [16]. For $A \in R^{m \times n}$ and $B\in R^{n\times \ell}$, the kernel $K(A,B)$ maps $R^{m\times n} \times R^{n\times \ell}$ into $R^{m\times \ell}$. A typical kernel is the Gaussian kernel $\varepsilon^{-\mu\Vert A_i'-B_{\cdot j}\Vert^2}$, $i,j=1,\ldots,m$, $\ell=m$, where $\varepsilon$ is the base of natural logarithms, while a linear kernel is $K(A,B)=AB$. For a column vector $x$ in $R^n$, $K(x',A')$ is a row vector in $R^m$, and the linear separating surface (4) is replaced by the nonlinear surface

\begin{displaymath}
K([x' \ \ -1],\left[ \begin{array}{c}
A'\\ -e'
\end{array}\right])Du=0,
\end{displaymath} (26)

where $u$ is the solution of the dual problem (11) with $Q$ re-defined for a general nonlinear kernel as follows:
\begin{displaymath}
G=[A\;\;\;-e],\;\;\;\;\;Q=\frac{I}{\nu}+DK(G,G')D.
\end{displaymath} (27)

Note that the nonlinear separating surface (26) degenerates to the linear one (4) if we let $K(G,G')=GG'$ and make use of (9).

To justify the nonlinear kernel formulation (27) we refer the reader to [16, Equation (8.9)] for a similar result and give a brief derivation here. If we rewrite our dual problem for a linear kernel (8) in the equivalent form:

\begin{displaymath}
\displaystyle{\min_{0\le u\in R^m}}\frac{1}{2}u'(\frac{I}{\nu}
+DGG'D)u-e'u,
\end{displaymath} (28)

and replace the linear kernel $GG'$ by a general nonlinear positive semidefinite symmetric kernel $K(G,G')$ we obtain:
\begin{displaymath}
\displaystyle{\min_{0\le u\in R^m}}\frac{1}{2}u'(\frac{I}{\nu}
+DK(G,G')D)u-e'u.
\end{displaymath} (29)

This is the formulation given above in (27). We note that the Karush-Kuhn-Tucker necessary and sufficient optimality conditions for this problem are:
\begin{displaymath}
0\le u\;\;\perp\;\;(\frac{I}{\nu}+DK([A\;\;\;-e],\left[ \begin{array}{c}
A'\\ -e'
\end{array}\right])D)u-e\ge 0,
\end{displaymath} (30)

which underly LSVM for a nonlinear positive semidefinite kernel $K(G,G')$. The positive semidefiniteness of the nonlinear kernel $K(G,G')$ is needed in order to ensure the existence of a solution to both (29) and (30).

All the results of the previous section remain valid, with $Q$ re-defined as above for any positive semidefinite kernel $K$. This includes the iterative scheme (15) and the convergence result given under the Algorithm 1. However, because we do not make use of the Sherman-Morrison-Woodbury identity for a nonlinear kernel, the LSVM MATLAB Code 3 is somewhat different and is as follows:

Code 3   LSVM MATLAB M-File for Nonlinear Kernel $K(\cdot,\cdot)$

function [it, opt,u] = svmlk(nu,itmax,tol,D,KM)
% lsvm with nonlinear kernel for min 1/2*u'*Q*u-e'*u s.t. u=>0
% Q=I/nu+DK(G,G')D, G=[A -e]
% Input: nu, itmax, tol, D,  KM=K(G,G')
% [it, opt,u] = svmlk(nu,itmax,tol,D,KM);
  m=size(KM,1);alpha=1.9/nu;e=ones(m,1);I=speye(m);it=0;
  Q=I/nu+D*KM*D;P=inv(Q);
  u=P*e;oldu=u+1;
  while it<itmax & norm(oldu-u)>tol
    oldu=u;
    u=P*(1+pl(Q*u-1-alpha*u));
    it=it+1;
  end;
  opt=norm(u-oldu);[it opt]

function pl = pl(x); pl = (abs(x)+x)/2;

Since we cannot use the Sherman-Morrison-Woodbury identity in the general nonlinear case, we note that this code for a nonlinear kernel is effective for moderately sized problems. The sizes of the matrices $P$ and $Q$ in Code 3 scale quadratically with the number of data points.


next up previous
Next: 5. Numerical Implementation and Up: Lagrangian Support Vector Machines Previous: 3. LSVM (Lagrangian Support
Journal of Machine Learning Research