Theoretical Considerations

In this section, we will see that although (single) round robin
classification turns a single *c*-class learning problem into
*c(c-1)/2* two-class problems, the total training effort is only
linear in the number of classes and smaller than the effort needed for
an unordered binarization. The analysis is independent of the type of
base learning algorithm used, although we will show that the advantage
increases with the computational complexity of the algorithm. Some of
the ideas have already been sketched in a short paragraph by
Friedman (1996), but we go into considerably
more detail and, in particular, focus on the comparison to
conventional class binarization techniques.

In the following, we assume a base learner with a time complexity
function *f(n)*, i.e., the time needed for learning a classifier from
a training set with *n* examples is *f(n)*. Note that we
interpret this as an *exact* function, and not as an
asymptotically tight bound as in
because we are not
interested in asymptotic behavior, but in the exact complexity at a
given training set size *n*.^{8} We will consider
functions of the form
and denote such a function with . We use *b|f* to denote a class
binarization with algorithm *b*, where an base learner with complexity
*f(n)* is applied to each binary problem. Unless mentioned otherwise,
all results refer to single round robin binarizations of problems with
more than two classes (*c > 2*).

Intuitively, the class penalty
measures the
performance of an algorithm on a class binarized *c*-class problem
relative to its performance on a single two-class problem the same
size *n*. In the following, it will turn out that in some cases the
class penalty function is independent of *n* or *f*. In such
cases, we will abbreviate the notation as .

For a base learner with a linear run-time complexity : .

As *f* is linear, the sum of the complexities on all individual
training sets is equal to the complexity of the algorithm on the sum
of all training set sizes, i.e.,
. Hence,

This analysis ignores a possible constant overhead of the
algorithm, which potentially affects *c(c-1)/2* function calls in
the round robin case, while it only affects *c* function calls in the
unordered case. However, some significant overhead costs, like reading
in the training examples (and similar initialization steps) need, of
course, only be performed once if the round robin procedure is
performed in memory (which was not the case in the implementation
which we used for the experimental results reported in the next
section). If there is an overhead to be considered (i.e.,
), the total costs will be increased by
. For very large values of *c*, these
quadratic costs may outweigh the savings, but under reasonable
assumptions (e.g., ) these additive costs should not matter,
in particular--as we shall see in the following--not in the case of
super-linear base algorithms.

*c* is even:

In this case, we can arrange the learning tasks in the form of *c-1*
rounds. Each round consists of *c/2* disjoint pairings, i.e.,
each class occurs exactly once in each round, and it has a different
partner in each round. Such a tournament schedule is always
possible.^{9} Without loss of generality, consider a round where classes
*2i* are paired against classes *2i-1*. The complexity of this round
is
. As for *p > 1* and
, *i = 1...N* it holds that
, and
because we assumed *c > 2*, we get

*c* is odd:

we add a dummy class with
examples, and perform a tournament as above. As this tournament has
*c* rounds,
.

For algorithms with at least linear complexity (): , i.e., single round robin is more efficient than unordered binarization.

As mentioned above, the bounds used for proving the lemmas are certainly not tight. This becomes obvious, if we look at problems with a uniform class distribution.

Therefore .

From this result, it is easy to see that
decreases
with increasing complexity order *p* of the base learner (assuming *c
> 2*). Likewise, for *p > 2*,
can be made arbitrarily small
by increasing the number of classes *c*. While the latter property is
hard to generalize for arbitrary class distributions--it is not the
case that every problem with more than *c* classes has a smaller class
penalty than an arbitrary *c*-class problem (and vice versa)--it is
easy to prove the following theorem:

For

As the decrease with increasing order *p* is strictly monotonic, this
theorem implies that the more expensive a learning algorithm is, the
larger will be the efficiency gain for using round robin binarization
instead of unordered binarization. In particular, it may be the case
that for expensive algorithms, even a *double* round robin is
faster than unordered binarization (in fact, it is easy to see that
Theorem 5.7 also holds for double round robin
binarization) or may be even faster than ordered binarization.
Assume, e.g., an algorithm with a quadratic complexity on a
class-balanced eight-class problem (i.e., *p=2* and *c = 8*).
According to Theorem 5.6 and
Lemma 5.2: