POS Tagging - Viterbi Decoding
Part-of-Speech (POS) tagging is a fundamental sequence labeling task in Natural Language Processing that assigns grammatical categories to words in context. The Viterbi algorithm provides an elegant dynamic programming solution to find the most probable sequence of POS tags using Hidden Markov Models.
1. Hidden Markov Models for POS Tagging
A Hidden Markov Model for POS tagging consists of:
- Hidden States (S): POS tags {Noun, Verb, Adjective, Determiner, ...}
- Observable Symbols (O): Words in the vocabulary {the, cat, runs, quickly, ...}
- Transition Probabilities (A): P(tag_j | tag_i) - likelihood of tag sequence
- Emission Probabilities (B): P(word | tag) - likelihood of word given tag
- Initial Probabilities (π): P(tag) - probability of starting with a tag
2. The Decoding Problem
Given a sequence of words W = w₁, w₂, ..., wₙ and HMM parameters (A, B, π), find the most likely tag sequence T* = t₁, t₂, ..., tₙ such that:
T* = argmax P(T | W)
Using Bayes' theorem and the Markov assumption:
T* = argmax ∏ᵢ₌₁ⁿ P(wᵢ | tᵢ) × P(tᵢ | tᵢ₋₁)
3. Viterbi Algorithm: Dynamic Programming Solution
The Viterbi algorithm solves this optimization problem efficiently using dynamic programming principles.
Mathematical Foundation
For each word position j and tag s, we compute:
V[s][j] = max{s'} (V[s'][j-1] × a{s',s}) × b_s(wⱼ)**
Where:
- V[s][j]: Maximum probability of any tag sequence ending in state s at position j
- a_{s',s}: Transition probability from tag s' to tag s
- b_s(wⱼ): Emission probability of word wⱼ given tag s
Algorithm Steps
1. Initialization (j = 1)
V[s][1] = π[s] × b_s(w₁)
Path[s][1] = 0
2. Recursion (j = 2 to N)
For each state s:
V[s][j] = max_{s'} (V[s'][j-1] × a_{s',s}) × b_s(wⱼ)
Path[s][j] = argmax_{s'} (V[s'][j-1] × a_{s',s})
3. Termination
P* = max_s V[s][N]
q*_N = argmax_s V[s][N]
4. Backtracking (j = N-1 to 1)
q*_j = Path[q*_{j+1}][j+1]
4. Example Walkthrough
Consider tagging "Book that flight" with tags {Noun, Verb, Det}:
Probability Matrices
Emission Matrix P(word|tag):
Book that flight
Noun 0.3 0.1 0.8
Verb 0.7 0.0 0.1
Det 0.0 0.9 0.0
Transition Matrix P(tag_j|tag_i):
Noun Verb Det
Noun 0.2 0.1 0.6
Verb 0.5 0.2 0.3
Det 0.8 0.2 0.0
Viterbi Table Computation
Time t=1 (Book):
- V[Noun][1] = 0.33 × 0.3 = 0.10
- V[Verb][1] = 0.33 × 0.7 = 0.23
- V[Det][1] = 0.33 × 0.0 = 0.00
Time t=2 (that):
- V[Noun][2] = max(0.10×0.2, 0.23×0.5, 0.00×0.8) × 0.1 = 0.0115
- V[Verb][2] = max(0.10×0.1, 0.23×0.2, 0.00×0.2) × 0.0 = 0.0
- V[Det][2] = max(0.10×0.6, 0.23×0.3, 0.00×0.0) × 0.9 = 0.0621
Time t=3 (flight):
- V[Noun][3] = max(0.0115×0.2, 0.0×0.5, 0.0621×0.8) × 0.8 = 0.0398
- V[Verb][3] = max(0.0115×0.1, 0.0×0.2, 0.0621×0.2) × 0.1 = 0.0001
- V[Det][3] = max(0.0115×0.6, 0.0×0.3, 0.0621×0.0) × 0.0 = 0.0
Optimal Path: Verb → Det → Noun = "Book that flight"
5. Computational Complexity
- Time Complexity: O(N × T²) where N = sentence length, T = number of tags
- Space Complexity: O(N × T) for the Viterbi table
Comparison: Without dynamic programming, finding optimal path requires O(T^N) time, making Viterbi essential for practical applications.
6. Key Insights
Optimal Substructure
The optimal solution contains optimal solutions to subproblems - crucial for dynamic programming.
Markov Property
Current tag depends only on the previous tag, not the entire history: P(tᵢ | t₁...tᵢ₋₁) = P(tᵢ | tᵢ₋₁)
Probability Balance
The algorithm optimally balances:
- Local compatibility: How well words fit their tags (emission probabilities)
- Global coherence: How well tag sequences flow together (transition probabilities)
7. Applications Beyond POS Tagging
- Speech Recognition: Finding most likely word sequences from acoustic signals
- Bioinformatics: Gene sequence analysis and protein structure prediction
- Named Entity Recognition: Identifying person, location, organization mentions
- Machine Translation: Word alignment between source and target languages
- Information Extraction: Structured data extraction from unstructured text
8. Practical Considerations
Smoothing Techniques
Handle unseen word-tag combinations using:
- Add-one (Laplace) smoothing
- Good-Turing estimation
- Back-off models
Unknown Words
Strategies for out-of-vocabulary words:
- Morphological analysis
- Character-level features
- Word embeddings
The Viterbi algorithm remains a cornerstone of sequence labeling, providing both theoretical elegance and practical efficiency for natural language processing tasks.