CYK Algorithm
Introduction
The Cocke-Younger-Kasami (CYK) algorithm is a dynamic programming algorithm used to determine whether a given string can be generated by a given context-free grammar in Chomsky Normal Form (CNF). The algorithm was independently developed by John Cocke, Daniel Younger, and Tadao Kasami in the 1960s.
Chomsky Normal Form
A context-free grammar is in Chomsky Normal Form if all production rules are of the form:
- A → BC (where A, B, and C are non-terminals)
- A → a (where A is a non-terminal and a is a terminal)
- S → ε (where S is the start symbol and ε is the empty string)
Algorithm Overview
The CYK algorithm works by building a triangular table (CYK table) where:
- The rows represent the length of the substring being considered
- The columns represent the starting position of the substring
- Each cell (i,j) contains the set of non-terminals that can generate the substring starting at position j with length i
Algorithm Steps
Initialization: For each position i in the input string w:
- Fill the first row with non-terminals that can generate the corresponding terminal
Filling the Table: For each cell (i,j):
- Consider all possible ways to split the substring into two parts
- For each split, check if there exists a production rule A → BC where:
- B can generate the first part
- C can generate the second part
- If such a rule exists, add A to the cell (i,j)
Acceptance: The string is accepted if the start symbol S is in the cell (n,1), where n is the length of the input string
Time and Space Complexity
- Time Complexity: O(n³|G|), where:
- n is the length of the input string
- |G| is the size of the grammar
- Space Complexity: O(n²|N|), where:
- n is the length of the input string
- |N| is the number of non-terminals
Applications
The CYK algorithm is used in:
- Natural Language Processing
- Compiler Design
- Syntax Analysis
- Grammar Validation
- String Matching
Example
Consider the grammar:
S → AB | BC
A → BA | a
B → CC | b
C → AB | a
And the input string "baaba"
The CYK table would be constructed as follows:
- First row: Fill with terminals
- Subsequent rows: Fill using the production rules
- Check if S is in the top-right cell
The algorithm will determine whether "baaba" can be generated by the given grammar.