## Concentration Bounds for Unigram Language Models

** Evgeny Drukh, Yishay Mansour**; 6(Aug):1231--1264, 2005.

### Abstract

We show several high-probability concentration bounds for learning
unigram language models. One interesting quantity is the
probability of all words appearing exactly *k* times in a sample
of size *m*. A standard estimator for this quantity is the
Good-Turing estimator. The existing analysis on its error shows a
high-probability bound of approximately *O(k / m ^{1/2})*.
We improve its dependency on

*k*to

*O(k*. We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately

^{1/4}/ m^{1/2}+ k / m)*O( 1 / k + k*. We derive a combined estimator, which has an error of approximately

^{1/2}/ m)*O(m*, for any

^{-2/5})*k*.

A standard measure for the quality of a learning algorithm is its
expected per-word log-loss. The leave-one-out method can be used
for estimating the log-loss of the unigram model. We show that its
error has a high-probability bound of approximately *O(1 / m ^{1/2})*,
for any underlying distribution.

We also bound the log-loss a priori, as a function of various parameters of the distribution.

[abs][pdf]