POS Tagging - Viterbi Decoding

Part-of-Speech (POS) tagging is a fundamental sequence labeling task in Natural Language Processing that assigns grammatical categories to words in context. The Viterbi algorithm provides an elegant dynamic programming solution to find the most probable sequence of POS tags using Hidden Markov Models.


1. Hidden Markov Models for POS Tagging

A Hidden Markov Model for POS tagging consists of:

  • Hidden States (S): POS tags {Noun, Verb, Adjective, Determiner, ...}
  • Observable Symbols (O): Words in the vocabulary {the, cat, runs, quickly, ...}
  • Transition Probabilities (A): P(tag_j | tag_i) - likelihood of tag sequence
  • Emission Probabilities (B): P(word | tag) - likelihood of word given tag
  • Initial Probabilities (π): P(tag) - probability of starting with a tag

2. The Decoding Problem

Given a sequence of words W = w₁, w₂, ..., wₙ and HMM parameters (A, B, π), find the most likely tag sequence T* = t₁, t₂, ..., tₙ such that:

T* = argmax P(T | W)

Using Bayes' theorem and the Markov assumption:

T* = argmax ∏ᵢ₌₁ⁿ P(wᵢ | tᵢ) × P(tᵢ | tᵢ₋₁)


3. Viterbi Algorithm: Dynamic Programming Solution

The Viterbi algorithm solves this optimization problem efficiently using dynamic programming principles.

Mathematical Foundation

For each word position j and tag s, we compute:

V[s][j] = max{s'} (V[s'][j-1] × a{s',s}) × b_s(wⱼ)**

Where:

  • V[s][j]: Maximum probability of any tag sequence ending in state s at position j
  • a_{s',s}: Transition probability from tag s' to tag s
  • b_s(wⱼ): Emission probability of word wⱼ given tag s

Algorithm Steps

1. Initialization (j = 1)
V[s][1] = π[s] × b_s(w₁)
          Path[s][1] = 0
          
2. Recursion (j = 2 to N)
For each state s:
              V[s][j] = max_{s'} (V[s'][j-1] × a_{s',s}) × b_s(wⱼ)
              Path[s][j] = argmax_{s'} (V[s'][j-1] × a_{s',s})
          
3. Termination
P* = max_s V[s][N]
          q*_N = argmax_s V[s][N]
          
4. Backtracking (j = N-1 to 1)
q*_j = Path[q*_{j+1}][j+1]
          

4. Example Walkthrough

Consider tagging "Book that flight" with tags {Noun, Verb, Det}:

Probability Matrices

Emission Matrix P(word|tag):

         Book   that   flight
          Noun     0.3    0.1    0.8
          Verb     0.7    0.0    0.1
          Det      0.0    0.9    0.0
          

Transition Matrix P(tag_j|tag_i):

         Noun   Verb   Det
          Noun     0.2    0.1    0.6
          Verb     0.5    0.2    0.3
          Det      0.8    0.2    0.0
          

Viterbi Table Computation

Time t=1 (Book):

  • V[Noun][1] = 0.33 × 0.3 = 0.10
  • V[Verb][1] = 0.33 × 0.7 = 0.23
  • V[Det][1] = 0.33 × 0.0 = 0.00

Time t=2 (that):

  • V[Noun][2] = max(0.10×0.2, 0.23×0.5, 0.00×0.8) × 0.1 = 0.0115
  • V[Verb][2] = max(0.10×0.1, 0.23×0.2, 0.00×0.2) × 0.0 = 0.0
  • V[Det][2] = max(0.10×0.6, 0.23×0.3, 0.00×0.0) × 0.9 = 0.0621

Time t=3 (flight):

  • V[Noun][3] = max(0.0115×0.2, 0.0×0.5, 0.0621×0.8) × 0.8 = 0.0398
  • V[Verb][3] = max(0.0115×0.1, 0.0×0.2, 0.0621×0.2) × 0.1 = 0.0001
  • V[Det][3] = max(0.0115×0.6, 0.0×0.3, 0.0621×0.0) × 0.0 = 0.0

Optimal Path: Verb → Det → Noun = "Book that flight"


5. Computational Complexity

  • Time Complexity: O(N × T²) where N = sentence length, T = number of tags
  • Space Complexity: O(N × T) for the Viterbi table

Comparison: Without dynamic programming, finding optimal path requires O(T^N) time, making Viterbi essential for practical applications.


6. Key Insights

Optimal Substructure

The optimal solution contains optimal solutions to subproblems - crucial for dynamic programming.

Markov Property

Current tag depends only on the previous tag, not the entire history: P(tᵢ | t₁...tᵢ₋₁) = P(tᵢ | tᵢ₋₁)

Probability Balance

The algorithm optimally balances:

  • Local compatibility: How well words fit their tags (emission probabilities)
  • Global coherence: How well tag sequences flow together (transition probabilities)

7. Applications Beyond POS Tagging

  • Speech Recognition: Finding most likely word sequences from acoustic signals
  • Bioinformatics: Gene sequence analysis and protein structure prediction
  • Named Entity Recognition: Identifying person, location, organization mentions
  • Machine Translation: Word alignment between source and target languages
  • Information Extraction: Structured data extraction from unstructured text

8. Practical Considerations

Smoothing Techniques

Handle unseen word-tag combinations using:

  • Add-one (Laplace) smoothing
  • Good-Turing estimation
  • Back-off models

Unknown Words

Strategies for out-of-vocabulary words:

  • Morphological analysis
  • Character-level features
  • Word embeddings

The Viterbi algorithm remains a cornerstone of sequence labeling, providing both theoretical elegance and practical efficiency for natural language processing tasks.