N-Grams Smoothing
Core N-gram and Smoothing Terms
N-gram:
A contiguous sequence of N items (typically words) from a given text or speech sample.
Bigram:
An N-gram where N=2; a sequence of two words.
Trigram:
An N-gram where N=3; a sequence of three words.
Unigram:
An N-gram where N=1; a single word.
Smoothing:
A technique used in language modeling to adjust probability estimates for unseen N-grams, preventing zero probabilities.
Add-One (Laplace) Smoothing:
A simple smoothing method that adds one to each N-gram count before calculating probabilities.
Zero Probability Problem:
The issue that arises when an N-gram does not appear in the training data, resulting in a probability of zero.
Vocabulary Size (V):
The total number of unique words in the corpus.
Maximum Likelihood Estimate (MLE):
A method of estimating probabilities based on observed frequencies in the training data, without smoothing.
Sparse Data:
A situation where many possible N-grams are not observed in the training corpus.
Analysis Terms
Language Model:
A statistical model that assigns probabilities to sequences of words.
Probability Distribution:
A function that describes the likelihood of occurrence of different possible outcomes.
Corpus:
A large and structured set of texts used for statistical analysis and hypothesis testing.
Token:
An individual instance of a sequence in text, such as a word or punctuation mark.
Type:
A unique word in the corpus, regardless of how many times it appears.
Context:
The surrounding words or items that influence the probability of a given word in an N-gram model.
Backoff:
A smoothing technique where lower-order N-gram probabilities are used when higher-order N-gram counts are zero.
Interpolation:
A smoothing technique that combines probabilities from different N-gram models.
Perplexity:
A measurement of how well a probability model predicts a sample; lower perplexity indicates a better model.
Training Data:
The set of text used to build and estimate the parameters of a language model.
Test Data:
The set of text used to evaluate the performance of a language model.
Practical Terms
Sentence Probability:
The probability assigned to a sequence of words by a language model.
Out-of-Vocabulary (OOV) Word:
A word that does not appear in the training corpus.
Frequency Count:
The number of times a particular N-gram appears in the corpus.
Conditional Probability:
The probability of a word given its preceding context in an N-gram model.
Normalization:
Adjusting probability values so that they sum to one.