N-Grams

Core N-Gram and Language Modeling Terms

N-Gram: A contiguous sequence of N items (typically words or tokens) from a given text. For example, "the cat" is a bigram (N=2).

Unigram: An N-Gram where N=1; a single word or token.

Bigram: An N-Gram where N=2; a sequence of two consecutive words or tokens.

Trigram: An N-Gram where N=3; a sequence of three consecutive words or tokens.

Markov Assumption: The simplifying assumption that the probability of a word depends only on the previous (N-1) words, not the entire sentence history.

Language Model: A statistical model that assigns probabilities to sequences of words, often using N-Gram statistics.

Corpus: A large and structured set of texts used for building N-Gram models.

Probability Table (N-Gram Table): A lookup table containing the probabilities or counts of N-Gram sequences in a corpus.

Smoothing: Techniques used to handle zero probabilities for unseen N-Grams by assigning them small nonzero probabilities.

Data Sparsity: The problem that arises when many possible N-Grams do not appear in the training corpus, making probability estimation difficult.

Context Window: The number of previous words considered when predicting the next word (N-1 for an N-Gram model).

Start/End Tokens: Special tokens (e.g., , , (eos)) used to mark the beginning and end of sentences in N-Gram modeling.

Chain Rule: A rule in probability theory used to decompose the probability of a sequence into conditional probabilities.

Smoothing Techniques: Methods like Laplace smoothing, Good-Turing discounting, and backoff models used to improve N-Gram probability estimates.