Language models are an essential element of natural language processing, central to tasks ranging from spellchecking to machine translation. Given an arbitrary piece of text, a language model determines whether that text belongs to a given language.
We can give a concrete example with a probabilistic language model, a specific construction which uses probabilities to estimate how likely any given string belongs to a language. Consider a probabilistic English language model PE. We would expect the probability
PE(I went to the store)
to be quite high, since we can confirm this is valid English. On the other hand, we expect the probabilities
PE(store went to I the),PE(Ich habe eine Katz)
to be very low, since these fragments do not constitute proper English text.
I don’t aim to cover the entirety of language models at the moment — that would be an ambitious task for a single blog post. If you haven’t encountered language models or n-grams before, I recommend the following resources:
- “Language model” on Wikipedia
- Chapter 4 of Jurafsky and Martin’s Speech and Language Processing
- Chapter 7 of Statistical Machine Translation (see summary slides online)
I’d like to jump ahead to a trickier subject within language modeling known as Kneser-Ney smoothing. This smoothing method is most commonly applied in an interpolated form,1 and this is the form that I’ll present today.
Kneser-Ney evolved from absolute-discounting interpolation, which makes use of both higher-order (i.e., higher-n) and lower-order language models, reallocating some probability mass from 4-grams or 3-grams to simpler unigram models. The formula for absolute-discounting smoothing as applied to a bigram language model is presented below:
Pabs(wi∣wi−1)=max
Here \delta refers to a fixed discount value, and \alpha is a normalizing constant. The details of this smoothing are covered in Chen and Goodman (1999).
The essence of Kneser-Ney is in the clever observation that we can take advantage of this interpolation as a sort of backoff model. When the first term (in this case, the discounted relative bigram count) is near zero, the second term (the lower-order model) carries more weight. Inversely, when the higher-order model matches strongly, the second lower-order term has little weight.
The Kneser-Ney design retains the first term of absolute discounting interpolation, but rewrites the second term to take advantage of this relationship. Whereas absolute discounting interpolation in a bigram model would simply default to a unigram model in the second term, Kneser-Ney depends upon the idea of a continuation probability associated with each unigram.
This probability for a given token w_i is proportional to the number of bigrams which it completes:
P_{\text{continuation}}(w_i) \propto \: \left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|
This quantity is normalized by dividing by the total number of bigram types (note that j is a free variable):
P_{\text{continuation}}(w_i) = \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}
The common example used to demonstrate the efficacy of Kneser-Ney is the phrase San Francisco. Suppose this phrase is abundant in a given training corpus. Then the unigram probability of Francisco will also be high. If we unwisely use something like absolute discounting interpolation in a context where our bigram model is weak, the unigram model portion may take over and lead to some strange results.
Dan Jurafsky gives the following example context:
I can’t see without my reading _____.
A fluent English speaker reading this sentence knows that the word glasses should fill in the blank. But since San Francisco is a common term, absolute-discounting interpolation might declare that Francisco is a better fit: P_{abs}(\text{Francisco}) > P_{abs}(\text{glasses}).
Kneser-Ney fixes this problem by asking a slightly harder question of our lower-order model. Whereas the unigram model simply provides how likely a word w_i is to appear, Kneser-Ney’s second term determines how likely a word w_i is to appear in an unfamiliar bigram context.
Kneser-Ney in whole follows:
P_{\mathit{KN}}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1} w')} + \lambda \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}\lambda is a normalizing constant
\lambda(w_{i-1}) = \dfrac{\delta}{c(w_{i-1})} \left| \{w' : c(w_{i-1}, w') > 0\} \right|.Note that the denominator of the first term can be simplified to a unigram count. Here is the final interpolated Kneser-Ney smoothed bigram model, in all its glory:
P_{\mathit{KN}}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{c(w_{i-1})} + \lambda \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}Further reading
If you enjoyed this post, here is some further reading on Kneser-Ney and other smoothing methods:
- Bill MacCartney’s smoothing tutorial (very accessible)
- Chen and Goodman (1999)
- Section 4.9.1 in Jurafsky and Martin’s Speech and Language Processing
-
For the canonical definition of interpolated Kneser-Ney smoothing, see S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech and Language, vol. 13, no. 4, pp. 359–394, 1999. ↩