next up previous
Next: Parsing Up: The ALLiS Algorithm Previous: The Search Space


Refinement

The main difference between ALLiS and classical inductive systems relies on the explicit account of noise by using refinement. The training data is a portion of the Wall Street Journal corpus tagged with the Brill tagger. This tagger has an average error rate of 5%, and during the extraction of the data from the original corpus, errors of bracketing are also introduced (chunks are not explicitly bracketed in the original corpus, and a non trivial processing is required for extracting them). Besides, since some very frequent words are mainly correctly tagged (such as the, to), it means that some other words are tagged with an error rate largely greater than 5% (especially nouns and verbs). It is then sensible to consider that, for some rules, an accuracy greater than 0.80 or 0.90 is not reasonable. The second reason for introducing this mechanism of refinement is linked to a property of Natural Languages: they contain also many sub-regularities and (pocket of) exceptions [see][]Excep99. These two points (noise and exceptions) lead us not except finding general rules (rules with a high coverage) with an accuracy equal to 1.

Often, systems only consider noiseless data and try to learn ``perfect'' rules (with an accuracy of 1). In order to deal with noise, some integrate mechanisms for stopping the expansion (see [Pazzani and Kibler(1992)]) or pruning mechanisms (for example, reduced error pruning in [Brunk and Pazzani(1991)]). As stopping mechanism, FOIL uses encoded-length heuristic indicating that the rule is becoming too complex. Another simple way is to set up a threshold lower than 1 as in [Soderland(1999)]. This is the same solution we have chosen to adopt. But what about the few percents wrongly categorized? Should we ignore them? The refinement process tries to learn these cases, considered as exceptions of the rule just learned. Let us take a simple example in order to illustrate this mechanism. Consider the rule:

<RULE NUM='1' ACC='0.98' FREQ='29027'>
<N C='NN'/>
</RULE>
Its accuracy is 0.98. Let the threshold $\theta $ be 0.8. The accuracy is high enough and the rule is learned but will maybe generate 2% of errors. In order to fix them, we just learn rules that correct them, relying on the fact that some kinds of errors are regular enough and can be explained. The initial rule is the rule itself and in order to learn exceptions, we just switch positive and negative examples, and apply the same algorithm (see line 6 of Algorithm 1). The goal is now to categorize elements matching the context described by the rule 1 which do not belong to NP. The positive examples of the rule 1 become then the negative examples of these exceptions. The positive examples are now composed of contexts where nodes with the feature C='NN' have not the feature S='I-NP'. There is a recursive call of the learning procedure with this inversion in the positive and negative data (Algorithm 1). The following rules are then learned:
<RULE EXCEP='1' NUM='3' ACC='1.00' FREQ='17'>
<N W='have' C='NN'/>
</RULE>

<RULE EXCEP='1' NUM='98' ACC='0.90' FREQ='9'>
<N C='IN' LEFT='1'/>
<N W='order' C='NN'/>
<N C='TO' RIGHT='1'/>
</RULE>
Such exceptions are called refinement rules. The feature EXCEP points to the rule (its identifier) to which it is an exception. Both exceptions are due to two different explanations. The first refinement rule illustrates well the interest of refinement for noisy data. In the training data, some occurrences of the word have are wrongly tagged as NN (singular noun) instead of verb (remember that this tagging is done automatically).

The second rule illustrates another kind of ``noise'', or more appropriately, illustrates a linguistic exception in the encoding schemata. It is specified in the guidelines of the Upenn Treebank, from where our data is extracted, that in the context in order to, the word order has to be tagged NN but does not belong to an NP. One can find other exceptions similar to this one in the guidelines (subject to for instance). These exceptions, which are not noise, since linguistically motivated, can be easily encapsulated thanks to refinement.

ALLiS can also learn exceptions of exceptions (and so on). This happens rarely when $\theta $ is high enough, but frequently in other cases. The next rule could be an exception to the rule 98, but it is not kept since its frequency is to small.

<RULE EXCEP='98' NUM='99' ACC='1.00' FREQ='1'>
<N W='of' C='IN' LEFT='1'/>
<N W='order' C='NN'/>
<N C='TO' RIGHT='1'/>
</RULE>

Each refinement rule is in fact a more specific rule describing a context in which the ``mother'' rule2 (this referred by the feature EXCEP) is no longer validated.


next up previous
Next: Parsing Up: The ALLiS Algorithm Previous: The Search Space
Hammerton J. 2002-03-13