Home Page




Editorial Board

Special Issues

Open Source Software

Proceedings (PMLR)

Data (DMLR)

Transactions (TMLR)




Frequently Asked Questions

Contact Us

RSS Feed

Concentration Bounds for Unigram Language Models

Evgeny Drukh, Yishay Mansour; 6(42):1231−1264, 2005.


We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a high-probability bound of approximately O(k / m1/2). We improve its dependency on k to O(k1/4 / m1/2 + k / m). We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O( 1 / k + k1/2 / m). We derive a combined estimator, which has an error of approximately O(m-2/5), for any k.

A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show that its error has a high-probability bound of approximately O(1 / m1/2), for any underlying distribution.

We also bound the log-loss a priori, as a function of various parameters of the distribution.

© JMLR 2005. (edit, beta)