CYK-style parsing and laziness

Hello, I have two questions regarding a Cocke, Younger, Kasami parser. Consider this program: type NT = Char -- Nonterminal type T = Char -- Terminal -- a Chomsky production has either two nonterminals or one terminal on its right-hand side type ChomskyProd = (NT, Either T (NT, NT)) -- a grammar consists of a startsymbol, nonterminal symbols, terminal symbols and productions type Grammar = (NT, [NT], [T], [ChomskyProd]) parse::Grammar->[T]->Bool parse (s, nts, ts, prods) w = s `elem` gs n 1 where n = length w table = [[gs i j|j<-[1..n-i+1]]|i<-[1..n]] gs 1 j = [nt|p<-prods,termProd p, let (nt, Left t)=p, w!!(j-1)==t] gs i j = [nt|k<-[1..i-1],p<-prods, not (termProd p), let (nt, Right (a, b))=p, a `elem` table!!(k-1)!!(j-1), b `elem` table!!(i-k-1)!!(j+k-1)] The sets gs i j contain all nonterminal symbols from which the substring of w starting at index j of length i can be derived. Please have a look at the last line of the algorithm. In my first attempt I just referred to gs k j and gs (i-k) (j+k) what looks a lot more intuitive. However I noted that this way the sets gs i j are evaluated multiple times. Is there a better and more readable way to prevent multiple evaluation of these sets? The second question regards lazy evaluation. Consider the stupid grammar S->S S S->A A A->a that generates a^(2n). The performance of the algorithm drops very fast even for small n, probably because the gs i j are getting very large. Is there a trick to get lazy evaluation into play here? It is sufficient to find only one occurence of the start symbol in gs n 1. Best regards, Steffen

Steffen Mazanek wrote:
I have two questions regarding a Cocke, Younger, Kasami parser.
type NT = Char -- Nonterminal type T = Char -- Terminal -- a Chomsky production has either two nonterminals or one terminal on its right-hand side type ChomskyProd = (NT, Either T (NT, NT)) -- a grammar consists of a startsymbol, nonterminal symbols, terminal symbols and productions type Grammar = (NT, [NT], [T], [ChomskyProd])
parse::Grammar->[T]->Bool parse (s, nts, ts, prods) w = s `elem` gs n 1 where n = length w table = [[gs i j|j<-[1..n-i+1]]|i<-[1..n]] gs 1 j = [nt|p<-prods,termProd p, let (nt, Left t)=p, w!!(j-1)==t] gs i j = [nt|k<-[1..i-1],p<-prods, not (termProd p), let (nt, Right (a, b))=p, a `elem` table!!(k-1)!!(j-1), b `elem` table!!(i-k-1)!!(j+k-1)]
The sets gs i j contain all nonterminal symbols from which the substring of w starting at index j of length i can be derived.
Please have a look at the last line of the algorithm. In my first attempt I just referred to gs k j and gs (i-k) (j+k) what looks a lot more intuitive. However I noted that this way the sets gs i j are evaluated multiple times. Is there a better and more readable way to prevent multiple evaluation of these sets?
The key point of the dynamic programming algorithm is indeed to memoize the results gs i j for all pairs of i and j. In other words, the insight that yields a fast algorithm is that for solving the subproblems gs i j (of which there are n^2), solution to smaller subproblems of the same form are sufficient. Tabulation is a must. Of course, you can still choose how to represent the table. There's a nice higher order way to do that tabulate :: (Int -> Int -> a) -> (Int -> Int -> a) gs = tabulate gs' where gs' 1 j = ... uses gs x y for some x y ... gs' i j = ... ditto ... The function tabulate takes a function on an N x N grid and well, tabulates it. Actually, it doesn't return the table but already the function that indexes into the table. Your current code uses lists to represent the table, i.e. tabulate f = f' where table = [[f i j | j<-[1..n-i+1]] | i<-[1..n]] f' = \i j -> table !! i !! j Note that this depends on sharing the table over multiple invocations of f'. In other words, tabulate f i j = table !! i !! j where table = ... would defeat it's purpose. Of course, accessing the (i,j)-th element of the table takes O(i+j) time. Another options is to use (boxed, immutable) arrays for the table which offer O(1) access. There's an example for this on the wiki page http://haskell.org/haskellwiki/Edit_distance
The second question regards lazy evaluation. Consider the stupid grammar S->S S S->A A A->a
that generates a^(2n).
The performance of the algorithm drops very fast even for small n, probably because the gs i j are getting very large.
That would be strange. I mean, no gs i j may have more than two elements, namely S or A. The other key point of the CYK algorithm is that the sets gs i j are indeed sets and may only contain as many elements as there are nonterminals. In your case however, nonterminals get chosen multiple times (for each production and, worse, for each k !). Be sure to keep every nonterminal only once. Again, you can choose whether to represent the sets of nonterminals by lists or with Data.Set or whatever. Here's a one (suboptimal) way to do it with lists gs = tabulate gs' where gs' 1 j = [nt | p<-prods, termProd p, let (nt, Left t)=p, w !! (j-1)==t ] gs' i j = [nt | p<-prods, not (termProd p) let (nt, Right (a, b))=p, not $ null [k | k<-[1..i-1], a `elem` gs k j, b `elem` gs (i-k) (j+k)]] The gs i j are still not sets, some nonterminals may appear multiple times. But they have at most length prods elements instead of previously length prods * n elements.
Is there a trick to get lazy evaluation into play here? It is sufficient to find only one occurence of the start symbol in gs n 1.
The fact that the gs i j have to be sets is not related to laziness. (Although the code above exploits that (not $ null [k | ...]) returns True as soon as possible thanks to lazy evaluation). Regards, apfelmus

Once again thank you apfelmus :-) The key point of the dynamic programming algorithm is indeed to memoize
the results gs i j for all pairs of i and j. In other words, the insight that yields a fast algorithm is that for solving the subproblems gs i j (of which there are n^2), solution to smaller subproblems of the same form are sufficient. Tabulation is a must.
I understand this, however I thought it would be possible to use the automatic collapsing of the termgraphs in some way. Of course, you can still choose how to represent the table. There's a
nice higher order way to do that
tabulate :: (Int -> Int -> a) -> (Int -> Int -> a)
gs = tabulate gs' where gs' 1 j = ... uses gs x y for some x y ... gs' i j = ... ditto ...
Thank you for this explanation. Your approach is not very concise either but it does not pollute the algorithm so much. That would be strange. I mean, no gs i j may have more than two
elements, namely S or A. The other key point of the CYK algorithm is that the sets gs i j are indeed sets and may only contain as many elements as there are nonterminals.
... You are right, of course. I have tried a nub before the list comprehension however this is evaluated too late. I should really use sets, however, I would miss the list comprehension syntactic sugar sooo much. Is there something similar for real Data.Set? Best regards, Steffen

Steffen Mazanek wrote:
apfelmus wrote
The key point of the dynamic programming algorithm is indeed to memoize the results gs i j for all pairs of i and j. In other words, the insight that yields a fast algorithm is that for solving the subproblems gs i j (of which there are n^2), solution to smaller subproblems of the same form are sufficient. Tabulation is a must.
I underst and this, however I thought it would be possible to use the automatic collapsing of the termgraphs in some way.
Well, some things have to be left to the programmer :), especially the choice of trading space for time. Note that there are very systematic and natural ways to derive dynamic programming algorithms in functional languages. In a sense, much of the work of R. Bird centers this topic. The book "Algebra of Programming" http://web.comlab.ox.ac.uk/oucl/research/pdt/ap/pubs.html#Bird-deMoor96:Alge... is one of the cornerstones. The systematic derivation of dynamic programming algorithms has been rediscovered in a more direct but less general fashion http://bibiserv.techfak.uni-bielefeld.de/adp/
Of course, you can still choose how to represent the table. There's a
nice higher order way to do that
tabulate :: (Int -> Int -> a) -> (Int -> Int -> a)
gs = tabulate gs' where gs' 1 j = ... uses gs x y for some x y ... gs' i j = ... ditto ...
Thank you for this explanation. Your approach is not very concise either but it does not pollute the algorithm so much.
Oh? It's concise enough for me :) The nice thing about an explicit 'tabulate' is that you can separate the table and the entry calculations completely.
That would be strange. I mean, no gs i j may have more than two
elements, namely S or A. The other key point of the CYK algorithm is that the sets gs i j are indeed sets and may only contain as many elements as there are nonterminals.
.... You are right, of course. I have tried a nub before the list comprehension however this is evaluated too late.
Yes, the nub has to eliminate storing a nonterminal "for every k". But this can be done in advance by noting that we're only interested in whether there exists at least one k [nt | ..., not $ null [k | k<-[1..i-1], a `elem` gs k j, b `elem` gs (i-k) (j+k)]] and don't want to emit a nonterminal for which k [nt | ..., k <- [1..i-1], a `elem` gs k j, b `elem` gs (i-k) (j+k)]
I should really use sets, however, I would miss the list comprehension syntactic sugar sooo much. Is there something similar for real Data.Set?
Not that I knew of, but you can always use fromList [nt | ...] Note that the Data.Set can be thought of as being part of the tabulation. In effect, you really deal with a 3-dimensional truth table gs i j nt = substring starting at j of length i can be derived by nonterminal nt Here's an implementation of the tabulation with Data.Set tabulate :: (Int -> Int -> NT -> Bool) -> (Int -> Int -> NT -> Bool) tabulate gs' = \i j nt -> nt `member` table ! (i,j) where table = array bnds [(ij, mkSet $ gs' i j) | ij@(i,j) <- range bnds] mkSet chi = fromList [nt | nt <- nts, chi nt] And here's the memoized function gs = tabulate gs' where gs' 1 j nt = any [True | Left t <- productions nt, w !! (j-1) == t] gs' i j nt = any [True | Right (a,b) <- productions nt, gs k j a, gs (i-k) (j+k) b] It assumes a function (dependent on the grammer at hand) productions :: NT -> [Either T (NT, NT)] that returns the terminal and nonterminal productions for a given nonterminal. Regards, apfelmus

Note that there are very systematic and natural ways to derive dynamic programming algorithms in functional languages. In a sense, much of the work of R. Bird centers this topic. The book "Algebra of Programming"
http://web.comlab.ox.ac.uk/oucl/research/pdt/ap/pubs.html#Bird-deMoor96:Alge...
is one of the cornerstones.
The systematic derivation of dynamic programming algorithms has been rediscovered in a more direct but less general fashion
Thanks a lot for providing the links! Interesting stuff! And also many thanks for the examplary implementation. This is really enlightening. Ciao, Steffen

Am Mittwoch, 23. Mai 2007 17:55 schrieb Steffen Mazanek:
Hello,
I have two questions regarding a Cocke, Younger, Kasami parser.
Consider this program:
type NT = Char -- Nonterminal type T = Char -- Terminal -- a Chomsky production has either two nonterminals or one terminal on its right-hand side type ChomskyProd = (NT, Either T (NT, NT)) -- a grammar consists of a startsymbol, nonterminal symbols, terminal symbols and productions type Grammar = (NT, [NT], [T], [ChomskyProd])
parse::Grammar->[T]->Bool parse (s, nts, ts, prods) w = s `elem` gs n 1 where n = length w table = [[gs i j|j<-[1..n-i+1]]|i<-[1..n]] gs 1 j = [nt|p<-prods,termProd p, let (nt, Left t)=p, w!!(j-1)==t] gs i j = [nt|k<-[1..i-1],p<-prods, not (termProd p), let (nt, Right (a, b))=p, a `elem` table!!(k-1)!!(j-1), b `elem` table!!(i-k-1)!!(j+k-1)]
The sets gs i j contain all nonterminal symbols from which the substring of w starting at index j of length i can be derived.
Please have a look at the last line of the algorithm. In my first attempt I just referred to gs k j and gs (i-k) (j+k) what looks a lot more intuitive. However I noted that this way the sets gs i j are evaluated multiple times. Is there a better and more readable way to prevent multiple evaluation of these sets?
I'm not sure about readability, but you get once-only evaluation by memoizing them (as you did), using an array is faster, and an array of sets is faster than an array of lists. For inputs of approximately 300 characters, your version takes about 180s here to parse the stupid grammar, my version below about 6.5s, using an array of lists for memoisation takes about 20s (all still horribly slow, I'm afraid). For a harder grammar ( S -> a | A E | B F A -> b | A H | S B B -> b | D I | A B C -> b D -> a E -> C C F -> C G G -> D D H -> A C I -> B A ), an array of lists is hardly better than a list of lists, an array of sets *far* better (the longer the input, the larger the gain, apparently). setparse :: Grammar -> [T] -> Bool setparse (s, nts, ts, prods) w = s `member` (table!(n,1)) where n = length w sortFst = sortBy (\(a,_) (b,_) -> compare a b) (tps, ntps) = partition termProd prods termMap :: Map T (Set NT) termMap = Map.fromAscListWith Set.union $ sortFst [(t, singleton nt) | (nt, Left t) <- tps] findTerm :: T -> Set NT findTerm = flip (Map.findWithDefault Set.empty) termMap table :: Array (Int,Int) (Set NT) table = array ((1,1),(n,n)) $ zip [(1,j) | j <- [1 .. ]] (map findTerm w) ++ [((i,j), Set.fromList [nt | (nt, Right (a, b)) <- ntps , k <- [1 .. i-1] , a `member` (table!(k,j)) , b `member` (table!(i-k,j+k))]) | i <- [2 .. n], j <- [1 .. n-i+1]]
The second question regards lazy evaluation. Consider the stupid grammar S->S S S->A A A->a
that generates a^(2n). The performance of the algorithm drops very fast even for small n, probably because the gs i j are getting very large.
They don't for the 'stupid' grammar above, all the gs have at most one element. Rather, there are so many to be checked (at least if the parse fails, all possibilities must be considered, for successful parses, it's conceivable that not all need be checked): to determine gs i j, for each (nt)production, we must check (i-1) pairs, for each i there are n-i+1 j's, altogether n*(n^2-1)/2*(number of (nt)productions) checks to do. So by the sheer number it's not surprising that the faster array lookup and Set-membership test give a massive speedup, but the algorithm remains O(n^3) or worse. HTH, Daniel
Is there a trick to get lazy evaluation into play here? It is sufficient to find only one occurence of the start symbol in gs n 1.
Best regards,
Steffen
participants (3)
-
apfelmus
-
Daniel Fischer
-
Steffen Mazanek