5. Numerical Implementation and Comparisons

The implementation of LSVM is straightforward, as shown in the previous sections. Our first ``dry run" experiments were on randomly generated problems just to test the speed and effectiveness of LSVM on large problems. We first used a Pentium III 500 MHz notebook with 384 megabytes of memory (and additional swap space) on 2 million randomly generated points in with and . LSVM solved the problem in 6 iterations in 81.52 minutes to an optimality criterion of on a 2-norm violation of (14). The same problem was solved in the same number of iterations and to the same accuracy in minutes on a 250 MHz UltraSPARC II processor with 2 gigabytes of memory. After these preliminary encouraging tests we proceeded to more systematic numerical tests as follows.

Most of the rest of our experiments were run on the Carleton College workstation ``gray'', which utilizes a 700 MHz Pentium III Xeon processor and a maximum of 2 Gigabytes of memory available for each process. This computer runs Redhat Linux 6.2, with MATLAB 6.0. The one set of experiments showing the checkerboard were run on the UW-Madison Data Mining Institute ``locop2'' machine, which utilizes a 400 MHz Pentium II Xeon and a maximum of 2 Gigabytes of memory available for each process. This computer runs Windows NT Server 4.0, with MATLAB 5.3.1. Both gray and locop2 are multiprocessor machines. However, only one processor was used for all the experiments shown here as MATLAB is a single threaded application and does not distribute any effort across processors [27]. We had exclusive access to these machines, so there were no issues with other users inconsistently slowing the computers down.

The first results we present are designed to show that our reformulation of the SVM (7) and its associated algorithm LSVM yield similar performance to the standard SVM (2), referred to here as SVM-QP. Results are also shown for our active set SVM (ASVM) algorithm [18]. For six datasets available from the UCI Machine Learning Repository [22], we performed tenfold cross validation in order to compare test set accuracies between the methodologies. Furthermore, we utilized a tuning set for each algorithm to find the optimal value of the parameter . For both LSVM and ASVM, we used an optimality tolerance of 0.001 to determine when to terminate. SVM-QP was implemented using the high-performing CPLEX barrier quadratic programming solver [9] with its default stopping criterion. Altering the CPLEX default stopping criterion to match that of LSVM did not result in significant change in timing relative to LSVM, but did reduce test set correctness for SVM-QP.

We also tested SVM-QP with the well-known optimized algorithm SVM [10]. Joachims, the author of SVM, graciously provided us with the newest version of the software (Version 3.10b) and advice on setting the parameters. All features for all experiments were normalized to the range as recommended in the SVM documentation. Such normalization was used in all experiments that we present, and is recommended for the LSVM algorithm due to the penalty on the bias parameter which we have added to our objective (7). We chose to use the default termination error criterion in SVM of . We also present an estimate of how much memory each algorithm used, as well as the number of support vectors. The number of support vectors was determined here as the number of components of the dual vector that were nonzero. Moreover, we also present the number of components of larger than 0.001 (referred to in the table as ``SVs with tolerance'').

The results demonstrate that LSVM performs comparably to SVM-QP with respect to generalizability, and shows running times comparable to or better than SVM. LSVM and ASVM show nearly identical generalization performance, as they solve the same optimization problem. Any differences in generalization observed between the two are only due to different approximate solutions being found at termination. LSVM and ASVM usually run in roughly the same amount of time, though there are some cases where ASVM is noticeably faster than LSVM. Nevertheless, this is impressive performance on the part of LSVM, which is a dramatically simpler algorithm than ASVM, CPLEX, and SVM. LSVM utilizes significantly less memory than the SVM-QP algorithms as well, though it does require more support vectors than SVM-QP. We address this point in the conclusion.

We next compare LSVM with SVM on the Adult dataset [22], which is commonly used to compare SVM algorithms [24,17]. These results, shown in Table 2, demonstrate that for the largest training sets LSVM performs faster than SVM with similar test set accuracies. Note that the ``iterations'' column takes on very different meanings for the two algorithms. SVM defines an iteration as solving an optimization problem over a a small number, or ``chunk,'' of constraints. LSVM, on the other hand, defines an iteration as a matrix calculation which updates all the dual variables simultaneously. These two numbers are not directly comparable, and are included here only for purposes of monitoring scalability. In this experiment, LSVM uses significantly more memory than SVM.

Table 3 shows results from running LSVM on a massively sized dataset. This dataset was created using our own NDC Data Generator [23] as suggested by Usama Fayyad. The results show that LSVM can be used to solve massive problems quickly, which is particularly intriguing given the simplicity of the LSVM Algorithm. Note that for these experiments, all the data was brought into memory. As such, the running time reported consists of the time used to actually solve the problem to termination excluding I/O time. This is consistent with the measurement techniques used by other popular approaches [11,24]. Putting all the data in memory is simpler to code and results in faster running times. However, it is not a fundamental requirement of our algorithm -- block matrix multiplications, incremental evaluations of using another application of the Sherman-Morrison-Woodbury identity, and indices on the dataset can be used to create an efficient disk based version of LSVM. We note that we tried to run SVM on this dataset, but after running for more than six hours it did not terminate.

We now demonstrate the effectiveness of LSVM in solving nonlinear
classification problems through the use of kernel functions. One highly
nonlinearly separable but simple example is the ``tried and true''
checkerboard dataset [8], which has often been used to show
the effectiveness of nonlinear kernel methods on a dataset for which a
linear separation clearly fails. The checkerboard dataset contains
1000 points randomly sampled from a checkerboard. These points are
used as a training set in LSVM to try to reproduce an accurate
rendering of a checkerboard. For this problem, we used the following
Gaussian kernel:

The final set of experiments, shown in Table 4, demonstrates the effectiveness of nonlinear kernel functions with LSVM on UCI datasets [22] specialized as follows for purposes of our tests:

- The
*liver-disorders*dataset contains 345 points, each consisting of six features. Class 1 contains 145 points, and class -1 contains 200 points. - The
*letter-recognition*dataset is used for recognizing letters of the alphabet. Traditionally, this dataset is used for multi-category classification. We used it here in a two class situation by taking a subset of those points which correspond to the letter ``A,'' and a subset of those points which correspond to the letter ``B.'' This resulted in a dataset of 600 points with 6 features, where each class contained 300 points. - The
*mushroom*dataset is a two class dataset which contains eight categorical attributes. We transformed each categorical attribute into a series of binary attributes, one attribute for each distinct value. For example, this dataset contains an attribute called ``cap surface,'' which can take one one of four categories, namely ``fibrous,'' ``grooves,'' ``scaly,'' or ``smooth.'' We represented this as four binary attributes. A 1 is assigned to the attribute that corresponds to the actual category, and a 0 to the rest. Thus the categorical value ``fibrous'' would be represented by , while the value ``smooth'' would be represented by . A subset of the entire mushroom dataset was used, so as to shorten the running times of our experiments. The final dataset we used contained 22 features with 200 points in class 1 and 300 points in class -1. - The
*tic-tac-toe*dataset is a two class dataset that contains incomplete tic-tac-toe games. All those games with a possible winning move for ``X'' end up in category 1, and all other games end up in category -1. We have intentionally represented this problem with a poor representation scheme to show the power of non-linear kernels in overcoming this difficulty. Each tic-tac-toe game is represented by 9 attributes, where each attribute corresponds to a spot on the tic-tac-toe board. An ``X'' is represented by 1, an ``O'' is represented by 0, and a blank is represented by -1. We used a subset of this dataset with 9 attributes, 200 points in class 1, and 100 points in class -1.

We chose which kernels to use and appropriate values of via tuning sets and tenfold cross-validation techniques. The results show that nonlinear kernels produce test set accuracies that improve on those obtained with linear kernels.