next up previous
Next: The accelerated tree learning Up: Classification with mixtures of Previous: The SPLICE dataset: Structure


The SPLICE dataset: Feature selection

Let us examine the single tree classifier that was used for the SPLICE data set more closely. According to the Markov properties of the tree distribution, the probability of the class variables depends only on its neighbors, that is, on the variables to which the class variable is connected by tree edges. Hence, a tree acts as an implicit variable selector for classification: only the variables adjacent to the queried variable (this set of variables is called the Markov blanket; pearl:88) are relevant for determining its probability distribution. This property also explains the observed preservation of the accuracy of the tree classifier when the size of the training set decreases: out of the 60 variables, only 18 are relevant to the class; moreover, the dependence is parametrized as 18 independent pairwise probability tables $T_{class,v}$. Such parameters can be fit accurately from relatively few examples. Hence, as long as the training set contains enough data to establish the correct dependency structure, the classification accuracy will degrade slowly with the decrease in the size of the data set. This explanation helps to understand the superiority of the tree classifier over the models in DELVE: only a small subset of variables are relevant for classification. The tree finds them correctly. A classifier that is not able to perform feature selection reasonably well will be hindered by the remaining irrelevant variables, especially if the training set is small. For a given Markov blanket, the tree classifies in the same way as a naive Bayes model with the Markov blanket variables as inputs. Note also that the naive Bayes model itself has a built-in feature selector: if one of the input variables $v$ is not relevant to the class, the distributions $P_{v\vert c}$ will be roughly the same for all values of $c$. Consequently, in the posterior $P_{c\vert v}$ that serves for classification, the factors corresponding to $v$ will simplify and thus $v$ will have little influence on the classification. This may explain why the naive Bayes model also performs well on the SPLICE classification task. Notice however that the variable selection mechanisms implemented by the tree classifier and the naive Bayes classifier are not the same. To verify that indeed the single tree classifier acts like a feature selector, we performed the following experiment, again using the SPLICE data. We augmented the variable set with another 60 variables, each taking 4 values with randomly and independently assigned probabilities. The rest of the experimental conditions (training set, test set and number of random restarts) were identical to the first SPLICE experiment. We fit a set of models with $m=1$, a small $\beta=0.1$ and no smoothing. The structure of the new models, in the form of a cumulative adjacency matrix, is shown in Figure 19. We see that the structure over the original 61 variables is unchanged and stable; the 60 noise variables connect in a random uniform patterns to the original variables and among each other. As expected after examining the structure, the classification performance of the new trees is not affected by the newly introduced variables: in fact the average accuracy of the trees over 121 variables is 95.8%, 0.1% higher than the accuracy of the original trees.
next up previous
Next: The accelerated tree learning Up: Classification with mixtures of Previous: The SPLICE dataset: Structure
Journal of Machine Learning Research 2000-10-19