The idea of pairwise classification has been known in the neural networks and statistics communities for some time. For example, Witten and Frank (2000, p.113) refer to it as a technique for making linear regression applicable to multi-class problems but do not cite a source. The earliest reference we could find is Knerr et al. (1990) who propose a stepwise procedure for linearizing non-linear multi-class problems by first trying to identify classes that can be solved by a one-against-all approach, then a pairwise approach, and finally a piece-wise linear technique. The motivation behind this and other works in the neural networks community is that it is often better to have a modular network, i.e., a network that consists of several simpler and independently trained sub-networks, rather than a single, complex multi-layer neural network, which usually requires a large hidden layer and significant training times. Unordered (Anand et al., 1995) and pairwise (Lu and Ito, 1999) binarization techniques have also been investigated in this context.
Friedman (1996) evaluates pairwise classification on two versions of CART (Breiman et al., 1984) and a nearest neighbor algorithm on 50 randomly generated problems. He observed improvements for the CART version which uses a linear function for splitting a node and for the nearest neighbor rule. For CART with axis parallel splits, the performance of the pairwise class binarization was similar to that of the standard techniques. The author also provided a brief discussion of the complexity of the approach.
Naturally, pairwise classification was also investigated within the support vector machine community. A comparison of unordered and round robin binarization on a speaker identification problem can be found in (Schmidt, 1996; Schmidt and Gish, 1996), without providing a clear conclusion about the preferable approach. Kreßel (1999) discusses the technique in some detail and presents empirical results on a digit recognition task, where the author notes the unexpected efficiency of the approach without providing an explanation for it. Recently, Hsu and Lin (2002) conducted an excellent empirical comparison between unordered binarization, pairwise classification with both the simple voting technique we used and with the more efficient proposal for organizing the classifiers into an efficient DAG (Platt et al., 2000), as well as two approaches for directly generalizing the support vector algorithm to multi-class problems. In their experiments, round robin binarization performed best, both in terms of accuracy and training time. Interestingly, the advantage over competing methods was more pronounced for a linear kernel than for a non-linear RBF kernel. We interpret this as evidence that round robin binarization simplifies the individual binary decision problems as motivated in Figure 1.
Angulo and Català (2000) suggest a related technique where multi-class problems are mapped to 3-class problems. Like with pairwise classification, the idea is to generate one training set for each pair of classes, but in addition to encoding the two class values with target values +1 and -1, examples of all other classes are added with a target value of 0, which gives up some of the advantages that result from the reduction of the training set sizes on the binary problems. They do not evaluate their approach. A similar idea was used by Kalousis and Theoharis (1999) for the meta-learning task of predicting the most suitable learning algorithm(s) for a given dataset. A nearest-neighbor learner was trained to predict the better algorithm for each pair of learning algorithms, where each of these pairwise problems had three classes: one for each algorithm and a third class ``tie'' if both algorithms performed indistinguishably.
Error-correcting output codes (Dietterich and Bakiri, 1995) are a popular and powerful class binarization technique. The basic idea is to encode a c-class problem as binary problems ( ), where each binary problem uses a subset of the classes as the positive class and the remaining classes as a negative class. It may thus be viewed as a generalization of unordered binarization, where only single classes are used as positive examples. As a consequence, each original class is encoded as a -dimensional binary vector, one dimension for each prediction of a binary problem (by convention +1 for positive and -1 for negative). The resulting matrix of the form is called the coding matrix. New examples are classified by determining the row in the matrix that is closest to the binary vector obtained by submitting the example to the classifiers. If the binary problems are chosen in a way that maximizes the distance between the class vectors, the reliability of the classification can be significantly increased. Error-correcting output codes can also be easily parallelized, but each subtask requires the total training set. As typically , the penalty function , i.e., pairwise and unordered binarization are more efficient.
Allwein et al. (2000) show that a generalization of error-correcting output codes can be used as a general framework for class binarization techniques. Their basic idea is to generalize the coding matrices in a way that allows examples to be ignored in the binary problems, i.e., to allow the coding matrices to be of the form . Clearly, pairwise classification is a special case in this framework. For example, the coding matrix for a double round robin has columns, where the column corresponding to the binary problem <i,j> has +1 in the i-th row, -1 in the j-th row and 0 in all other rows. Thus, each row is a vector of the form . Note, however, that the vector of predictions is of the form because every binary classifier will make a prediction (either +1 or -1) for every example. Nevertheless, the simple voting procedure we used is equivalent to finding the row that is most similar to the prediction vector (if similarity is measured with the Hamming distance), which is equivalent to the decoding technique suggested by Dietterich and Bakiri (1995). Allwein et al. (2000) criticize this simple Hamming decoding and suggest the use of loss-based decoding techniques that take into account the confidence of the base learner into its own predictions. In a theoretical analysis, which relates the training error of decoding methods to the error on the binary problems and to the minimum distance between entries in the coding matrix, the authors derive upper bounds for the training error of loss-based decoding that are lower than those for Hamming decoding. In an experimental study with five different class binarization techniques and three decoding techniques (two of them loss-based and one Hamming decoding), the loss-based techniques seemed to produce lower generalization errors. Their results also showed that for support vector machines unordered binarization is inferior to all other techniques, among them pairwise classification. Among these alternatives, no clear winner emerged. However, for boosted decision trees, unordered binarization performed on the same level as all other approaches.
Finally, we note the relation of round robin classification to comparison training (Utgoff and Clouse, 1991; Tesauro, 1989), which has been proposed as a training procedure in evaluation function learning. In this framework, the learner is not trained with the target values of the evaluation function in certain states, but instead is trained on state pairs where the preferable state is marked. Thus, this training procedure is somewhere between supervised learning (where the function is trained on examples of target values) and reinforcement learning (where it only receives indirect feedback about the value of states). Tesauro (1989) demonstrated a particularly interesting technique, where he enforced a symmetric neural network architecture and showed that with this architecture, one only has to perform n network evaluations to determine the best of n moves. It is an interesting open question, whether a similar technique could be employed for speeding up pairwise classification.