Kneser-Ney smoothing explained

Language models are an essential element of natural language processing, central to tasks ranging from spellchecking to machine translation. Given an arbitrary piece of text, a language model determines whether that text belongs to a given language.

We can give a concrete example with a probabilistic language model, a specific construction which uses probabilities to estimate how likely any given string belongs to a language. Consider a probabilistic English language model \( P_E \). We would expect the probability

\[P_E(\text{I went to the store})\]

to be quite high, since we can confirm this is valid English. On the other hand, we expect the probabilities

\[P_E(\text{store went to I the}), P_E(\text{Ich habe eine Katz})\]

to be very low, since these fragments do not constitute proper English text.

I don’t aim to cover the entirety of language models at the moment — that would be an ambitious task for a single blog post. If you haven’t encountered language models or n-grams before, I recommend the following resources:

“Language model” on Wikipedia
Chapter 4 of Jurafsky and Martin’s Speech and Language Processing
Chapter 7 of Statistical Machine Translation (see summary slides online)

I’d like to jump ahead to a trickier subject within language modeling known as Kneser-Ney smoothing. This smoothing method is most commonly applied in an interpolated form,¹ and this is the form that I’ll present today.

Kneser-Ney evolved from absolute-discounting interpolation, which makes use of both higher-order (i.e., higher-n) and lower-order language models, reallocating some probability mass from 4-grams or 3-grams to simpler unigram models. The formula for absolute-discounting smoothing as applied to a bigram language model is presented below:

\[P_{abs}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1} w')} + \alpha\; p_{abs}(w_i)\]

Here \(\delta\) refers to a fixed discount value, and \(\alpha\) is a normalizing constant. The details of this smoothing are covered in Chen and Goodman (1999).

The essence of Kneser-Ney is in the clever observation that we can take advantage of this interpolation as a sort of backoff model. When the first term (in this case, the discounted relative bigram count) is near zero, the second term (the lower-order model) carries more weight. Inversely, when the higher-order model matches strongly, the second lower-order term has little weight.

The Kneser-Ney design retains the first term of absolute discounting interpolation, but rewrites the second term to take advantage of this relationship. Whereas absolute discounting interpolation in a bigram model would simply default to a unigram model in the second term, Kneser-Ney depends upon the idea of a continuation probability associated with each unigram.

This probability for a given token \(w_i\) is proportional to the number of bigrams which it completes:

\[P_{\text{continuation}}(w_i) \propto \: \left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|\]

This quantity is normalized by dividing by the total number of bigram types (note that \(j\) is a free variable):

\[P_{\text{continuation}}(w_i) = \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}\]

The common example used to demonstrate the efficacy of Kneser-Ney is the phrase San Francisco. Suppose this phrase is abundant in a given training corpus. Then the unigram probability of Francisco will also be high. If we unwisely use something like absolute discounting interpolation in a context where our bigram model is weak, the unigram model portion may take over and lead to some strange results.

Dan Jurafsky gives the following example context:

I can’t see without my reading _____.

A fluent English speaker reading this sentence knows that the word glasses should fill in the blank. But since San Francisco is a common term, absolute-discounting interpolation might declare that Francisco is a better fit: \(P_{abs}(\text{Francisco}) > P_{abs}(\text{glasses})\).

Kneser-Ney fixes this problem by asking a slightly harder question of our lower-order model. Whereas the unigram model simply provides how likely a word \(w_i\) is to appear, Kneser-Ney’s second term determines how likely a word \(w_i\) is to appear in an unfamiliar bigram context.

Kneser-Ney in whole follows:

\[P_{\mathit{KN}}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1} w')} + \lambda \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}\]

\(\lambda\) is a normalizing constant

\[\lambda(w_{i-1}) = \dfrac{\delta}{c(w_{i-1})} \left| \{w' : c(w_{i-1}, w') > 0\} \right|.\]

Note that the denominator of the first term can be simplified to a unigram count. Here is the final interpolated Kneser-Ney smoothed bigram model, in all its glory:

\[P_{\mathit{KN}}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{c(w_{i-1})} + \lambda \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}\]

Further reading