next up previous
Next: Categorization of Adjuncts Up: The Categorization Process Previous: The Categorization Process

Categorization of Nuclei

The context used to find out the nuclei relies on this simple following observation: A structure requires at least one nucleus4. The pattern used in order to determine nuclei is:
$\displaystyle f\raisebox{-5pt}{\footnotesize char}(X)= \frac{\sum_{C}{[X]}}{\sum_{C}{X}}$     (1)

$\sum_{C}{P}$ corresponds to the number of occurrences of the pattern P in the corpus $C$. The brackets are used in order to delimit the structure (NP for example). $[$ PRP $]$ means that the structure is composed of only one word tagged PRP. Elements that occur alone in a structure are assimilated to nucleus, since a structure requires a nucleus. For example, the tags PRP (pronoun) and NNP (proper noun) may compose alone a structure (respectively 99% and 48% of these tags appear alone in NP) but the tag JJ appears rarely alone (0.009%). A threshold tuned empirically is used in order to discriminate nuclei from adjuncts. We deduce that PRP and NNP belong to the nuclei and not JJ. But this criterion does not allow the identification of all the nuclei. Some often appear with adjuncts (an English noun (NN) often5 occurs with a determiner or an adjective and thus appears alone only 13%). The single use of this characteristic provides a continuum of values where the automatic set up of a threshold between adjuncts and nuclei is problematic and depends on the structure. To solve this problem, the categorization of nuclei is decomposed into two steps. First we identify characteristic adjuncts, i.e. adjuncts which can not appear alone since they depend on a nucleus. An effect of the function $f\raisebox{-5pt}{\footnotesize char}$ is to provide a very small value for adjuncts. If the value of an element is lower than a given threshold ( $\theta\raisebox{-5pt}{\footnotesize char}=0.05$), then it is categorized as a characteristic adjunct. The same function is used for determining nuclei and adjuncts. In the first case, the function has to be greater than a given threshold, in the second case, lesser than a given second threshold. For example the number of occurrences in the training corpus of the pattern $[ JJ ]$ is 99, and the number of occurrences of the pattern $JJ$ (including the pattern [ JJ ]) is 11097. So $f\raisebox{-5pt}{\footnotesize char}(JJ)=0.009$, value being low enough to consider JJ as a characteristic adjunct. The list provided by $f\raisebox{-5pt}{\footnotesize char}$ for English NP is:
$A\raisebox{-5pt}{\footnotesize char}=\{DT, PRP\$, POS, JJ, JJR, JJS, $``$ , $´´$\}$
These elements can correspond to left or right adjuncts. All the adjuncts are not identified, but this list allows the identification of the nuclei.

The second step consists in introducing these elements into a new pattern used by the function $f\raisebox{-5pt}{\footnotesize nu}$. This pattern matches elements surrounded by these characteristic adjuncts. It thus matches nuclei which often appear with adjuncts. Since a sequence of adjuncts (as an adjunct alone) can not alone compose a complete structure, $X$ only matches elements which correspond to a nucleus.

$\displaystyle f\raisebox{-5pt}{\footnotesize nu}(X)=\frac{\sum_{C}{[\
A\raiseb...
...size {char}}*  X \
A\raisebox{-5pt}{\footnotesize {char}}*\
]}}{\sum_{C}{X}}$     (2)

The function $f\raisebox{-5pt}{\footnotesize nu}$ is a good discrimination function between nuclei and adjuncts and provides very low values for adjuncts and very high values for nuclei (Table 7).


Table 7: Detection of some nuclei of the English NP (nouns and pronouns).
x Freq(x) $f\raisebox{-3pt}{\small nu}(x)$ nucleus
POS 1759 0.00 no
PRP$ 1876 0.01 no
JJ 11097 0.02 no
DT 18097 0.03 no
RB 919 0.06 no
NNP 11046 0.74 yes
NN 21240 0.87 yes
NNPS 164 0.93 yes
NNS 7774 0.95 yes
WP 527 0.97 yes
PRP 3800 0.99 yes



next up previous
Next: Categorization of Adjuncts Up: The Categorization Process Previous: The Categorization Process
Hammerton J. 2002-03-13