Language models are an essential element of natural language processing, central to tasks ranging from spellchecking to machine translation. Given an arbitrary piece of text, a language model determines whether that text belongs to a given language.
We can give a concrete example with a probabilistic language model, a specific construction which uses probabilities to estimate how likely any given string belongs to a language. Consider a probabilistic English language model \( P_E \). We would expect the probability
\[P_E(\text{I went to the store})\]
to be quite high, since we can confirm this is valid English. On the other hand, we expect the probabilities
\[P_E(\text{store went to I the}), P_E(\text{Ich habe eine Katz})\]
to be very low, since these fragments do not constitute proper English text.
I don’t aim to cover the entirety of language models at the moment — that would be an ambitious task for a single blog post. If you haven’t encountered language models or ngrams before, I recommend the following resources:
 “Language model” on Wikipedia
 Chapter 4 of Jurafsky and Martin’s Speech and Language Processing
 Chapter 7 of Statistical Machine Translation (see summary slides online)
I’d like to jump ahead to a trickier subject within language modeling known as KneserNey smoothing. This smoothing method is most commonly applied in an interpolated form,^{1} and this is the form that I’ll present today.
KneserNey evolved from absolutediscounting interpolation, which makes use of both higherorder (i.e., highern) and lowerorder language models, reallocating some probability mass from 4grams or 3grams to simpler unigram models. The formula for absolutediscounting smoothing as applied to a bigram language model is presented below:
\[P_{abs}(w_i \mid w_{i1}) = \dfrac{\max(c(w_{i1} w_i)  \delta, 0)}{\sum_{w'} c(w_{i1} w')} + \alpha\; p_{abs}(w_i)\]
Here \(\delta\) refers to a fixed discount value, and \(\alpha\) is a normalizing constant. The details of this smoothing are covered in Chen and Goodman (1999).
The essence of KneserNey is in the clever observation that we can take advantage of this interpolation as a sort of backoff model. When the first term (in this case, the discounted relative bigram count) is near zero, the second term (the lowerorder model) carries more weight. Inversely, when the higherorder model matches strongly, the second lowerorder term has little weight.
The KneserNey design retains the first term of absolute discounting interpolation, but rewrites the second term to take advantage of this relationship. Whereas absolute discounting interpolation in a bigram model would simply default to a unigram model in the second term, KneserNey depends upon the idea of a continuation probability associated with each unigram.
This probability for a given token \(w_i\) is proportional to the number of bigrams which it completes:
\[P_{\text{continuation}}(w_i) \propto \: \left \{ w_{i1} : c(w_{i1}, w_i) > 0 \} \right\]
This quantity is normalized by dividing by the total number of bigram types (note that \(j\) is a free variable):
\[P_{\text{continuation}}(w_i) = \dfrac{\left \{ w_{i1} : c(w_{i1}, w_i) > 0 \} \right}{\left \{ w_{j1} : c(w_{j1},w_j) > 0\} \right}\]
The common example used to demonstrate the efficacy of KneserNey is the phrase San Francisco. Suppose this phrase is abundant in a given training corpus. Then the unigram probability of Francisco will also be high. If we unwisely use something like absolute discounting interpolation in a context where our bigram model is weak, the unigram model portion may take over and lead to some strange results.
Dan Jurafsky gives the following example context:
I can’t see without my reading _____.
A fluent English speaker reading this sentence knows that the word glasses should fill in the blank. But since San Francisco is a common term, absolutediscounting interpolation might declare that Francisco is a better fit: \(P_{abs}(\text{Francisco}) > P_{abs}(\text{glasses})\).
KneserNey fixes this problem by asking a slightly harder question of our lowerorder model. Whereas the unigram model simply provides how likely a word \(w_i\) is to appear, KneserNey’s second term determines how likely a word \(w_i\) is to appear in an unfamiliar bigram context.
KneserNey in whole follows:
\[P_{\mathit{KN}}(w_i \mid w_{i1}) = \dfrac{\max(c(w_{i1} w_i)  \delta, 0)}{\sum_{w'} c(w_{i1} w')} + \lambda \dfrac{\left \{ w_{i1} : c(w_{i1}, w_i) > 0 \} \right}{\left \{ w_{j1} : c(w_{j1},w_j) > 0\} \right}\]\(\lambda\) is a normalizing constant
\[\lambda(w_{i1}) = \dfrac{\delta}{c(w_{i1})} \left \{w' : c(w_{i1}, w') > 0\} \right.\]Note that the denominator of the first term can be simplified to a unigram count. Here is the final interpolated KneserNey smoothed bigram model, in all its glory:
\[P_{\mathit{KN}}(w_i \mid w_{i1}) = \dfrac{\max(c(w_{i1} w_i)  \delta, 0)}{c(w_{i1})} + \lambda \dfrac{\left \{ w_{i1} : c(w_{i1}, w_i) > 0 \} \right}{\left \{ w_{j1} : c(w_{j1},w_j) > 0\} \right}\]Further reading
If you enjoyed this post, here is some further reading on KneserNey and other smoothing methods:
 Bill MacCartney’s smoothing tutorial (very accessible)
 Chen and Goodman (1999)
 Section 4.9.1 in Jurafsky and Martin’s Speech and Language Processing

For the canonical definition of interpolated KneserNey smoothing, see S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech and Language, vol. 13, no. 4, pp. 359–394, 1999. ↩