Virtual Labs

POS Tagging - Viterbi Decoding

Part-of-Speech (POS) tagging is a fundamental sequence labeling task in Natural Language Processing that assigns grammatical categories to words in context. The Viterbi algorithm provides an elegant dynamic programming solution to find the most probable sequence of POS tags using Hidden Markov Models.

1. Hidden Markov Models for POS Tagging

A Hidden Markov Model for POS tagging consists of:

Hidden States (S): POS tags {Noun, Verb, Adjective, Determiner, ...}
Observable Symbols (O): Words in the vocabulary {the, cat, runs, quickly, ...}
Transition Probabilities (A): P(tag_j | tag_i) - likelihood of tag sequence
Emission Probabilities (B): P(word | tag) - likelihood of word given tag
Initial Probabilities (π): P(tag) - probability of starting with a tag

2. The Decoding Problem

Given a sequence of words W = w₁, w₂, ..., wₙ and HMM parameters (A, B, π), find the most likely tag sequence T* = t₁, t₂, ..., tₙ such that:

T* = argmax P(T | W)

Using Bayes' theorem and the Markov assumption:

T* = argmax ∏ᵢ₌₁ⁿ P(wᵢ | tᵢ) × P(tᵢ | tᵢ₋₁)

3. Viterbi Algorithm: Dynamic Programming Solution

The Viterbi algorithm solves this optimization problem efficiently using dynamic programming principles.

Mathematical Foundation

For each word position j and tag s, we compute:

V[s][j] = max{s'} (V[s'][j-1] × a{s',s}) × b_s(wⱼ)**

Where:

V[s][j]: Maximum probability of any tag sequence ending in state s at position j
a_{s',s}: Transition probability from tag s' to tag s
b_s(wⱼ): Emission probability of word wⱼ given tag s

Algorithm Steps

1. Initialization (j = 1)

V[s][1] = π[s] × b_s(w₁)
          Path[s][1] = 0

2. Recursion (j = 2 to N)

For each state s:
              V[s][j] = max_{s'} (V[s'][j-1] × a_{s',s}) × b_s(wⱼ)
              Path[s][j] = argmax_{s'} (V[s'][j-1] × a_{s',s})

3. Termination

P* = max_s V[s][N]
          q*_N = argmax_s V[s][N]

4. Backtracking (j = N-1 to 1)

q*_j = Path[q*_{j+1}][j+1]

4. Example Walkthrough

Consider tagging "Book that flight" with tags {Noun, Verb, Det}:

Probability Matrices

Emission Matrix P(word|tag):

         Book   that   flight
          Noun     0.3    0.1    0.8
          Verb     0.7    0.0    0.1
          Det      0.0    0.9    0.0

Transition Matrix P(tag_j|tag_i):

         Noun   Verb   Det
          Noun     0.2    0.1    0.6
          Verb     0.5    0.2    0.3
          Det      0.8    0.2    0.0

Viterbi Table Computation

Time t=1 (Book):

V[Noun][1] = 0.33 × 0.3 = 0.10
V[Verb][1] = 0.33 × 0.7 = 0.23
V[Det][1] = 0.33 × 0.0 = 0.00

Time t=2 (that):

V[Noun][2] = max(0.10×0.2, 0.23×0.5, 0.00×0.8) × 0.1 = 0.0115
V[Verb][2] = max(0.10×0.1, 0.23×0.2, 0.00×0.2) × 0.0 = 0.0
V[Det][2] = max(0.10×0.6, 0.23×0.3, 0.00×0.0) × 0.9 = 0.0621

Time t=3 (flight):

V[Noun][3] = max(0.0115×0.2, 0.0×0.5, 0.0621×0.8) × 0.8 = 0.0398
V[Verb][3] = max(0.0115×0.1, 0.0×0.2, 0.0621×0.2) × 0.1 = 0.0001
V[Det][3] = max(0.0115×0.6, 0.0×0.3, 0.0621×0.0) × 0.0 = 0.0

Optimal Path: Verb → Det → Noun = "Book that flight"

5. Computational Complexity

Time Complexity: O(N × T²) where N = sentence length, T = number of tags
Space Complexity: O(N × T) for the Viterbi table

Comparison: Without dynamic programming, finding optimal path requires O(T^N) time, making Viterbi essential for practical applications.

6. Key Insights

Optimal Substructure

The optimal solution contains optimal solutions to subproblems - crucial for dynamic programming.

Markov Property

Current tag depends only on the previous tag, not the entire history: P(tᵢ | t₁...tᵢ₋₁) = P(tᵢ | tᵢ₋₁)

Probability Balance

The algorithm optimally balances:

Local compatibility: How well words fit their tags (emission probabilities)
Global coherence: How well tag sequences flow together (transition probabilities)

7. Applications Beyond POS Tagging

Speech Recognition: Finding most likely word sequences from acoustic signals
Bioinformatics: Gene sequence analysis and protein structure prediction
Named Entity Recognition: Identifying person, location, organization mentions
Machine Translation: Word alignment between source and target languages
Information Extraction: Structured data extraction from unstructured text

8. Practical Considerations

Smoothing Techniques

Handle unseen word-tag combinations using:

Add-one (Laplace) smoothing
Good-Turing estimation
Back-off models

Unknown Words

Strategies for out-of-vocabulary words:

Morphological analysis
Character-level features
Word embeddings

The Viterbi algorithm remains a cornerstone of sequence labeling, providing both theoretical elegance and practical efficiency for natural language processing tasks.