CYK Algorithm

Introduction

The Cocke-Younger-Kasami (CYK) algorithm is a dynamic programming algorithm used to determine whether a given string can be generated by a given context-free grammar in Chomsky Normal Form (CNF). The algorithm was independently developed by John Cocke, Daniel Younger, and Tadao Kasami in the 1960s.

Chomsky Normal Form

A context-free grammar is in Chomsky Normal Form if all production rules are of the form:

  • A → BC (where A, B, and C are non-terminals)
  • A → a (where A is a non-terminal and a is a terminal)
  • S → ε (where S is the start symbol and ε is the empty string)

Algorithm Overview

The CYK algorithm works by building a triangular table (CYK table) where:

  • The rows represent the length of the substring being considered
  • The columns represent the starting position of the substring
  • Each cell (i,j) contains the set of non-terminals that can generate the substring starting at position j with length i

Algorithm Steps

  1. Initialization: For each position i in the input string w:

    • Fill the first row with non-terminals that can generate the corresponding terminal
  2. Filling the Table: For each cell (i,j):

    • Consider all possible ways to split the substring into two parts
    • For each split, check if there exists a production rule A → BC where:
      • B can generate the first part
      • C can generate the second part
    • If such a rule exists, add A to the cell (i,j)
  3. Acceptance: The string is accepted if the start symbol S is in the cell (n,1), where n is the length of the input string

Time and Space Complexity

  • Time Complexity: O(n³|G|), where:
    • n is the length of the input string
    • |G| is the size of the grammar
  • Space Complexity: O(n²|N|), where:
    • n is the length of the input string
    • |N| is the number of non-terminals

Applications

The CYK algorithm is used in:

  • Natural Language Processing
  • Compiler Design
  • Syntax Analysis
  • Grammar Validation
  • String Matching

Example

Consider the grammar:

S → AB | BC
          A → BA | a
          B → CC | b
          C → AB | a
          

And the input string "baaba"

The CYK table would be constructed as follows:

  1. First row: Fill with terminals
  2. Subsequent rows: Fill using the production rules
  3. Check if S is in the top-right cell

The algorithm will determine whether "baaba" can be generated by the given grammar.