Theoretical Considerations

In this section, we will see that although (single) round robin classification turns a single c-class learning problem into c(c-1)/2 two-class problems, the total training effort is only linear in the number of classes and smaller than the effort needed for an unordered binarization. The analysis is independent of the type of base learning algorithm used, although we will show that the advantage increases with the computational complexity of the algorithm. Some of the ideas have already been sketched in a short paragraph by Friedman (1996), but we go into considerably more detail and, in particular, focus on the comparison to conventional class binarization techniques.

In the following, we assume a base learner with a time complexity function f(n), i.e., the time needed for learning a classifier from a training set with n examples is f(n). Note that we interpret this as an exact function, and not as an asymptotically tight bound as in $\Theta(f(n))$ because we are not interested in asymptotic behavior, but in the exact complexity at a given training set size n.⁸ We will consider functions of the form $f(n) = \lambda n^p (p \geq 1, \lambda > 0)$ and denote such a function with

. We use b|f to denote a class binarization with algorithm b, where an base learner with complexity f(n) is applied to each binary problem. Unless mentioned otherwise, all results refer to single round robin binarizations of problems with more than two classes (c > 2).

Definition 5.1 (class penalty) Let $g_{b\vert f}(c,n)$ be the total complexity for using a learner with complexity f(n) on a problem with c > 2 classes that has been class-binarized using binarization algorithm b. We then define the class penalty function $\pi_{b\vert f}$ as

$\displaystyle \pi_{b\vert f}(c,n) = \frac{g_{b\vert f}(c,n)}{f(n)}$

Intuitively, the class penalty $\pi_{b\vert f}(c,n)$ measures the performance of an algorithm on a class binarized c-class problem relative to its performance on a single two-class problem the same size n. In the following, it will turn out that in some cases the class penalty function is independent of n or f. In such cases, we will abbreviate the notation as $\pi_b(c)$ .

Proof. Each example of the original training set will occur exactly once in each of the c-1 binary tasks where its class is paired against one of the other c-1 classes. As there are n examples in the original training set, the total number of examples is (c-1)n.

As f is linear, the sum of the complexities on all individual training sets is equal to the complexity of the algorithm on the sum of all training set sizes, i.e., $\sum f_1(n_i) = f_1(\sum n_i)$ . Hence,

$\displaystyle g_{r\vert f_1}(c,n) = f_1((c-1)n) = \lambda (c-1)n = (c-1)(\lambda n) = (c-1)f_1(n)$

Therefore $\pi_{r\vert f_1}(c) = c-1$ . $\qedsymbol$

This analysis ignores a possible constant overhead of the algorithm, which potentially affects c(c-1)/2 function calls in the round robin case, while it only affects c function calls in the unordered case. However, some significant overhead costs, like reading in the training examples (and similar initialization steps) need, of course, only be performed once if the round robin procedure is performed in memory (which was not the case in the implementation which we used for the experimental results reported in the next section). If there is an overhead $\mu$ to be considered (i.e., $f(n) = \lambda n + \mu$ ), the total costs will be increased by $\mu c(c-1)/2$ . For very large values of c, these quadratic costs may outweigh the savings, but under reasonable assumptions (e.g.,

) these additive costs should not matter, in particular--as we shall see in the following--not in the case of super-linear base algorithms.

Proof. Assume we have c classes, class i has

examples, $\sum_{i=1}^{c} n_i = n$ .

c is even:
In this case, we can arrange the learning tasks in the form of c-1 rounds. Each round consists of c/2 disjoint pairings, i.e., each class occurs exactly once in each round, and it has a different partner in each round. Such a tournament schedule is always possible.⁹ Without loss of generality, consider a round where classes 2i are paired against classes 2i-1. The complexity of this round is $\sum_{i=1}^{c/2} f_p(n_{2i} + n_{2i-1})$ . As for p > 1 and , i = 1...N it holds that $\sum_i a_i^p < (\sum_i a_i)^p$ , and because we assumed c > 2, we get

$\displaystyle \sum_{i=1}^{\frac{c}{2}} f_p(n_{2i} + n_{2i-1}) < f_p(\sum_{i=1}^{\frac{c}{2}} n_{2i} + n_{2i-1}) = f_p(\sum_{i=1}^c n_i) = f_p(n)$

Analogously, we can derive the same upper bound for each of the c-1 rounds. Thus the total complexity of the round robin binarization $g_{r\vert f_p}(c,n) < (c-1) f_p(n)$ and $\pi_{r\vert f_p}(c,n) < c-1$ .

c is odd:
we add a dummy class with $n_{c+1} = 0$ examples, and perform a tournament as above. As this tournament has c rounds, $\pi_{r\vert f_p}(c,n) < c$ .

$\qedsymbol$

As mentioned above, the bounds used for proving the lemmas are certainly not tight. This becomes obvious, if we look at problems with a uniform class distribution.

Proof. In the (single) round robin case, we have c(c-1)/2 problems with 2n/c examples each. Hence the total complexity is

$\displaystyle g_{r\vert f_p}(c,n)$	$\displaystyle =$	$\displaystyle \frac{c(c-1)}{2} f_p\left(2\frac{n}{c}\right) = (c-1)\frac{c}{2} ... ...a\left(2\frac{n}{c}\right)^p = (c-1)\left(\frac{2}{c}\right)^{p-1}\lambda n^p =$
	$\displaystyle =$	$\displaystyle (c-1)\left(\frac{2}{c}\right)^{p-1} f_p(n)$

Therefore $\pi_{r\vert f_p}(c) = (c-1)(\frac{2}{c})^{p-1}$ . $\qedsymbol$

From this result, it is easy to see that $\pi_{r\vert f_p}(c,n)$ decreases with increasing complexity order p of the base learner (assuming c > 2). Likewise, for p > 2, $\pi_{r\vert f_p}(c,n)$ can be made arbitrarily small by increasing the number of classes c. While the latter property is hard to generalize for arbitrary class distributions--it is not the case that every problem with more than c classes has a smaller class penalty than an arbitrary c-class problem (and vice versa)--it is easy to prove the following theorem:

Theorem 5.7 (class penalty ratio for increasing order of base algorithm)
For c > 2, the class penalty ratio $\frac{\pi_{r\vert f_p}(c,n)}{\pi_{u\vert f_p}(c,n)}$ is monotonically decreasing with p and

$\displaystyle \lim_{p\rightarrow\infty}\frac{\pi_{r\vert f_p}(c,n)}{\pi_{u\vert f_p}(c,n)} = 0$

Proof. The total effort for the single round robin is $\sum_{i=1}^{c-1} \sum_{j=i+1}^c f_p(n_i + n_j)$ . Hence the class penalty ratio is:

$\displaystyle \frac{\pi_{r\vert f_p}(c,n)}{\pi_{u\vert f_p}(c,n)} = \frac{g_{r... ... = \frac{1}{c}\sum_{i=1}^{c-1} \sum_{j=i+1}^c \frac{ f_p(n_i + n_j)}{f_p(n)}$

Consequently,

$\displaystyle \lim_{p\rightarrow\infty}\frac{\pi_{r\vert f_p}(c,n)}{\pi_{u\vert... ...um_{j=i+1}^c \lim_{p\rightarrow\infty} \left( \frac{n_i + n_j}{n} \right)^p = 0$

The last equation holds because c > 2 and $0 < n_i + n_j < n (i,j = 1\dots c \, ; \; i \neq j)$ , hence $\frac{n_i + n_j}{n} < 1$ . For the same reasons $\left( \frac{n_i + n_j}{n} \right)^p > \left( \frac{n_i + n_j}{n} \right)^{p+\epsilon}$ for all $\epsilon > 0$ and $\frac{\pi_{r\vert f_p}(c,n)}{\pi_{u\vert f_p}(c,n)}$ is monotonically decreasing with p. $\qedsymbol$

As the decrease with increasing order p is strictly monotonic, this theorem implies that the more expensive a learning algorithm is, the larger will be the efficiency gain for using round robin binarization instead of unordered binarization. In particular, it may be the case that for expensive algorithms, even a double round robin is faster than unordered binarization (in fact, it is easy to see that Theorem 5.7 also holds for double round robin binarization) or may be even faster than ordered binarization. Assume, e.g., an algorithm with a quadratic complexity on a class-balanced eight-class problem (i.e., p=2 and c = 8). According to Theorem 5.6 and Lemma 5.2: