
Steffen Mazanek wrote:
I have two questions regarding a Cocke, Younger, Kasami parser.
type NT = Char -- Nonterminal type T = Char -- Terminal -- a Chomsky production has either two nonterminals or one terminal on its right-hand side type ChomskyProd = (NT, Either T (NT, NT)) -- a grammar consists of a startsymbol, nonterminal symbols, terminal symbols and productions type Grammar = (NT, [NT], [T], [ChomskyProd])
parse::Grammar->[T]->Bool parse (s, nts, ts, prods) w = s `elem` gs n 1 where n = length w table = [[gs i j|j<-[1..n-i+1]]|i<-[1..n]] gs 1 j = [nt|p<-prods,termProd p, let (nt, Left t)=p, w!!(j-1)==t] gs i j = [nt|k<-[1..i-1],p<-prods, not (termProd p), let (nt, Right (a, b))=p, a `elem` table!!(k-1)!!(j-1), b `elem` table!!(i-k-1)!!(j+k-1)]
The sets gs i j contain all nonterminal symbols from which the substring of w starting at index j of length i can be derived.
Please have a look at the last line of the algorithm. In my first attempt I just referred to gs k j and gs (i-k) (j+k) what looks a lot more intuitive. However I noted that this way the sets gs i j are evaluated multiple times. Is there a better and more readable way to prevent multiple evaluation of these sets?
The key point of the dynamic programming algorithm is indeed to memoize the results gs i j for all pairs of i and j. In other words, the insight that yields a fast algorithm is that for solving the subproblems gs i j (of which there are n^2), solution to smaller subproblems of the same form are sufficient. Tabulation is a must. Of course, you can still choose how to represent the table. There's a nice higher order way to do that tabulate :: (Int -> Int -> a) -> (Int -> Int -> a) gs = tabulate gs' where gs' 1 j = ... uses gs x y for some x y ... gs' i j = ... ditto ... The function tabulate takes a function on an N x N grid and well, tabulates it. Actually, it doesn't return the table but already the function that indexes into the table. Your current code uses lists to represent the table, i.e. tabulate f = f' where table = [[f i j | j<-[1..n-i+1]] | i<-[1..n]] f' = \i j -> table !! i !! j Note that this depends on sharing the table over multiple invocations of f'. In other words, tabulate f i j = table !! i !! j where table = ... would defeat it's purpose. Of course, accessing the (i,j)-th element of the table takes O(i+j) time. Another options is to use (boxed, immutable) arrays for the table which offer O(1) access. There's an example for this on the wiki page http://haskell.org/haskellwiki/Edit_distance
The second question regards lazy evaluation. Consider the stupid grammar S->S S S->A A A->a
that generates a^(2n).
The performance of the algorithm drops very fast even for small n, probably because the gs i j are getting very large.
That would be strange. I mean, no gs i j may have more than two elements, namely S or A. The other key point of the CYK algorithm is that the sets gs i j are indeed sets and may only contain as many elements as there are nonterminals. In your case however, nonterminals get chosen multiple times (for each production and, worse, for each k !). Be sure to keep every nonterminal only once. Again, you can choose whether to represent the sets of nonterminals by lists or with Data.Set or whatever. Here's a one (suboptimal) way to do it with lists gs = tabulate gs' where gs' 1 j = [nt | p<-prods, termProd p, let (nt, Left t)=p, w !! (j-1)==t ] gs' i j = [nt | p<-prods, not (termProd p) let (nt, Right (a, b))=p, not $ null [k | k<-[1..i-1], a `elem` gs k j, b `elem` gs (i-k) (j+k)]] The gs i j are still not sets, some nonterminals may appear multiple times. But they have at most length prods elements instead of previously length prods * n elements.
Is there a trick to get lazy evaluation into play here? It is sufficient to find only one occurence of the start symbol in gs n 1.
The fact that the gs i j have to be sets is not related to laziness. (Although the code above exploits that (not $ null [k | ...]) returns True as soon as possible thanks to lazy evaluation). Regards, apfelmus