[Haskell-cafe] CYK-style parsing and laziness

23 May 2007

      Hello,

I have two questions regarding a Cocke, Younger, Kasami parser.

Consider this program:

type NT = Char -- Nonterminal
type T = Char  -- Terminal
-- a Chomsky production has either two nonterminals or one terminal on its
right-hand side
type ChomskyProd = (NT, Either T (NT, NT))
-- a grammar consists of a startsymbol, nonterminal symbols, terminal
symbols and productions
type Grammar = (NT, [NT], [T], [ChomskyProd])

parse::Grammar->[T]->Bool
parse (s, nts, ts, prods) w = s `elem` gs n 1
 where
 n = length w
 table = [[gs i j|j<-[1..n-i+1]]|i<-[1..n]]
 gs 1 j = [nt|p<-prods,termProd p,
              let (nt, Left t)=p, w!!(j-1)==t]
 gs i j = [nt|k<-[1..i-1],p<-prods,
              not (termProd p),
              let (nt, Right (a, b))=p,
              a `elem` table!!(k-1)!!(j-1), b `elem`
table!!(i-k-1)!!(j+k-1)]

The sets gs i j contain all nonterminal symbols from which the substring of
w starting at index
j of length i can be derived.

Please have a look at the last line of the algorithm. In my first attempt I
just referred to
gs k j and gs (i-k) (j+k) what looks a lot more intuitive. However I noted
that this way the
sets gs i j are evaluated multiple times. Is there a better and more
readable way to prevent
multiple evaluation of these sets?

The second question regards lazy evaluation. Consider the stupid grammar
S->S S
S->A A
A->a

that generates a^(2n).
The performance of the algorithm drops very fast even for small n, probably
because the gs i j
are getting very large.
Is there a trick to get lazy evaluation into play here? It is sufficient to
find only one occurence
of the start symbol in gs n 1.

Best regards,

Steffen

[Haskell-cafe] CYK-style parsing and laziness

Steffen Mazanek