[Haskell-cafe] Re: CYK-style parsing and laziness

23 May 2007

      Steffen Mazanek wrote:
...
I have two questions regarding a Cocke, Younger, Kasami parser.
type NT = Char -- Nonterminal
type T = Char  -- Terminal
-- a Chomsky production has either two nonterminals or one terminal on its
right-hand side
type ChomskyProd = (NT, Either T (NT, NT))
-- a grammar consists of a startsymbol, nonterminal symbols, terminal
symbols and productions
type Grammar = (NT, [NT], [T], [ChomskyProd])
parse::Grammar->[T]->Bool
parse (s, nts, ts, prods) w = s `elem` gs n 1
where
n = length w
table = [[gs i j|j<-[1..n-i+1]]|i<-[1..n]]
gs 1 j = [nt|p<-prods,termProd p,
             let (nt, Left t)=p, w!!(j-1)==t]
gs i j = [nt|k<-[1..i-1],p<-prods,
             not (termProd p),
             let (nt, Right (a, b))=p,
             a `elem` table!!(k-1)!!(j-1), b `elem`
table!!(i-k-1)!!(j+k-1)]
The sets gs i j contain all nonterminal symbols from which the substring of
w starting at index j of length i can be derived.
Please have a look at the last line of the algorithm. In my first attempt I
just referred to gs k j and gs (i-k) (j+k) what looks a lot more intuitive.
However I noted that this way the sets gs i j are evaluated multiple
times.
Is there a better and more readable way to prevent multiple evaluation of
these sets?
The key point of the dynamic programming algorithm is indeed to memoize
the results gs i j for all pairs of i and j. In other words, the insight
 that yields a fast algorithm is that for solving the subproblems gs i j
(of which there are n^2), solution to smaller subproblems of the same
form are sufficient. Tabulation is a must.

Of course, you can still choose how to represent the table. There's a
nice higher order way to do that

  tabulate :: (Int -> Int -> a) -> (Int -> Int -> a)

  gs = tabulate gs'
     where
     gs' 1 j = ... uses  gs x y  for some  x y ...
     gs' i j = ... ditto ...

The function  tabulate  takes a function on an N x N grid and well,
tabulates it. Actually, it doesn't return the table but already the
function that indexes into the table.

Your current code uses lists to represent the table, i.e.

  tabulate f = f'
     where
     table = [[f i j | j<-[1..n-i+1]] | i<-[1..n]]
     f'    = \i j -> table !! i !! j

Note that this depends on sharing the  table  over multiple invocations
of f'. In other words,

  tabulate f i j = table !! i !! j
     where table = ...

would defeat it's purpose.

Of course, accessing the (i,j)-th element of the  table  takes O(i+j)
time. Another options is to use (boxed, immutable) arrays for the table
which offer O(1) access. There's an example for this on the wiki page

   http://haskell.org/haskellwiki/Edit_distance
...
The second question regards lazy evaluation. Consider the stupid grammar
S->S S
S->A A
A->a
that generates a^(2n).
The performance of the algorithm drops very fast even for small n, probably
because the gs i j are getting very large.
That would be strange. I mean, no gs i j may have more than two
elements, namely S or A. The other key point of the CYK algorithm is
that the sets gs i j are indeed sets and may only contain as many
elements as there are nonterminals.

In your case however, nonterminals get chosen multiple times (for each
production and, worse, for each k !). Be sure to keep every nonterminal
only once. Again, you can choose whether to represent the sets of
nonterminals by lists or with Data.Set or whatever.

Here's a one (suboptimal) way to do it with lists

  gs = tabulate gs'
     where
     gs' 1 j = [nt | p<-prods, termProd p,
                     let (nt, Left t)=p, w !! (j-1)==t ]

     gs' i j = [nt | p<-prods, not (termProd p)
                     let (nt, Right (a, b))=p,
                     not $ null
                        [k | k<-[1..i-1],
                             a `elem` gs k j,
                             b `elem` gs (i-k) (j+k)]]

The gs i j are still not sets, some nonterminals may appear multiple
times. But they have at most  length prods  elements instead of
previously  length prods * n  elements.
...
Is there a trick to get lazy evaluation into play here?
It is sufficient to find only one occurence of the start
symbol in gs n 1.
The fact that the gs i j have to be sets is not related to laziness.
(Although the code above exploits that (not $ null [k | ...]) returns
True as soon as possible thanks to lazy evaluation).

Regards,
apfelmus