[Parsec] A combinator to match between M and N times?

Parsec provides "count n p" to run the parser p exactly n times. I'm looking for a combinator "countBetween m n p" which will run the parser between m and n times. It does not exist in Parsec. Much to my surprise, it seems quite difficult to write it myself and, until now, I failed (the best result I had was with the "option" combinator, which unfortunately requires a dummy value, returned when the parser fails). Does anyone has a solution? Preferrably one I can understand, which means not yet with liftM :-)

On Tue, Aug 29, 2006 at 03:05:39PM +0200, Stephane Bortzmeyer wrote:
Parsec provides "count n p" to run the parser p exactly n times. I'm looking for a combinator "countBetween m n p" which will run the parser between m and n times. It does not exist in Parsec.
Much to my surprise, it seems quite difficult to write it myself and, until now, I failed (the best result I had was with the "option" combinator, which unfortunately requires a dummy value, returned when the parser fails).
How about this? countBetween m n p = do xs <- count m p ys <- count (n - m) $ option Nothing $ do y <- p return (Just y) return (xs ++ catMaybes ys) Assuming n >= m.
Does anyone has a solution? Preferrably one I can understand, which means not yet with liftM :-)
No liftM, as requested :-) Best regards Tomasz

On Tue, 29 Aug 2006, Tomasz Zielonka wrote:
How about this?
countBetween m n p = do xs <- count m p ys <- count (n - m) $ option Nothing $ do y <- p return (Just y) return (xs ++ catMaybes ys)
Assuming n >= m.
Does anyone has a solution? Preferrably one I can understand, which means not yet with liftM :-)
No liftM, as requested :-)
Although, one is tempted to write liftM Just p instead of do y <- p return (Just y) :-)

On Tue, Aug 29, 2006 at 03:11:09PM +0200,
Tomasz Zielonka
How about this?
It works fine, many thanks. Here is the version rewritten according to my taste: import Text.ParserCombinators.Parsec import Data.Maybe (catMaybes) countBetween m n p | n < m = error "First bound must be lower or equal than second bound" | otherwise = do xs <- count m p ys <- count (n - m) ((option Nothing) (do y <- p return (Just y))) return (xs ++ catMaybes ys)

On Aug 29, 2006, at 9:11 AM, Tomasz Zielonka wrote:
On Tue, Aug 29, 2006 at 03:05:39PM +0200, Stephane Bortzmeyer wrote:
Parsec provides "count n p" to run the parser p exactly n times. I'm looking for a combinator "countBetween m n p" which will run the parser between m and n times. It does not exist in Parsec.
Much to my surprise, it seems quite difficult to write it myself and, until now, I failed (the best result I had was with the "option" combinator, which unfortunately requires a dummy value, returned when the parser fails).
How about this?
countBetween m n p = do xs <- count m p ys <- count (n - m) $ option Nothing $ do y <- p return (Just y) return (xs ++ catMaybes ys)
Assuming n >= m.
Does anyone has a solution? Preferrably one I can understand, which means not yet with liftM :-)
No liftM, as requested :-)
Here's an interesting puzzle. For a moment, consider parsec only wrt its language-recognition capabilities. Then, we expect the count combinator to factor, count x p >> count y p === count (x+y) p where === mean "accepts the same set of strings". I somehow intuitively expect the countBetween combinator to factor in a similar way also, but it doesn't (at least, none of the posted versions do)! Note the output of: parser1 = countBetween 3 7 (char 'a') >> eof parser2 = countBetween 2 3 (char 'a') >> countBetween 1 4 (char 'a')
eof
main = do print $ parse parser1 "" "aaa" print $ parse parser2 "" "aaa" OK. What's happening is that the greedy nature of the combinator breaks things because parsec doesn't do backtracking by default. I'd expect to be able to insert 'try' in the right places to make it work. However, after playing around for a few minutes, I can't figure out any combination that does it. Is it possible to write this combinator so that it factors in this way?
Best regards Tomasz
Rob Dockins Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG

Robert Dockins wrote:
On Aug 29, 2006, at 9:11 AM, Tomasz Zielonka wrote:
On Tue, Aug 29, 2006 at 03:05:39PM +0200, Stephane Bortzmeyer wrote:
Parsec provides "count n p" to run the parser p exactly n times. I'm looking for a combinator "countBetween m n p" which will run the parser between m and n times. It does not exist in Parsec.
Much to my surprise, it seems quite difficult to write it myself and, until now, I failed (the best result I had was with the "option" combinator, which unfortunately requires a dummy value, returned when the parser fails).
How about this?
countBetween m n p = do xs <- count m p ys <- count (n - m) $ option Nothing $ do y <- p return (Just y) return (xs ++ catMaybes ys)
Assuming n >= m.
Does anyone has a solution? Preferrably one I can understand, which means not yet with liftM :-)
No liftM, as requested :-)
Here's an interesting puzzle. For a moment, consider parsec only wrt its language-recognition capabilities.
Then, we expect the count combinator to factor,
count x p >> count y p === count (x+y) p
where === mean "accepts the same set of strings".
I somehow intuitively expect the countBetween combinator to factor in a similar way also, but it doesn't (at least, none of the posted versions do)! Note the output of:
parser1 = countBetween 3 7 (char 'a') >> eof parser2 = countBetween 2 3 (char 'a') >> countBetween 1 4 (char 'a') >> eof
main = do print $ parse parser1 "" "aaa" print $ parse parser2 "" "aaa"
OK. What's happening is that the greedy nature of the combinator breaks things because parsec doesn't do backtracking by default. I'd expect to be able to insert 'try' in the right places to make it work. However, after playing around for a few minutes, I can't figure out any combination that does it. Is it possible to write this combinator so that it factors in this way?
My regex-parsec part of TextRegexLazy implements Greedy,Lazy,and Possessive semantics for regular expressions using Parsec. It is not obvious at first how to insert <|> and 'try'. You have to use a continuation style. The above example could be simply done, however, as: count 2 (char 'a') choice [count 1 (char 'a') >> countBetween 1 4 (char 'a') ,countBetween 1 4 (char 'a') ] This can be automated. A not-maximally efficient version would be: cb m n p cont | m<=n = do xs <- count m p let rep 0 = return xs rep i = do ys <- count i p return (xs++ys) choice [ try (rep i >>= cont) | i <- [(n-m),(n-m)-1 .. 0] ] test = cb 2 3 (string "ab") (\xs -> cb 1 4 (string "ab") (\ys -> return (xs,ys))) go = runParser test () "" "abababac" Where go now returns Right (["ab","ab"],["ab"])

How to use parsec to match optional things. Lets consider regex style matching of greedy a?b, lazy a??b, and possessive a?+b. And consider greedy a*b, lazy a*?b, and possessive a*+b. Then I think these examples (cribbed from my module) will work: -- Building blocks greedyOpt p contFail contSuccess = try (p >> contSuccess) <|> contFail lazyOpt p contFail contSuccess = try contFail <|> (p >> contSuccess) possessiveOpt p contFail contSuccess = ((try p) >> contSuccess) <|> contFail -- Match p*cont p*?cont p*+cont greedyStar p cont = fix (greedyOpt p cont) lazyStar p cont = fix (lazyOpt p cont) possessiveStar p cont = fix (possessiveOpt p cont) -- Match p?cont p??cont p?+cont greedyQuest p cont = greedyOpt p cont cont lazyQuest p cont = lazyOpt p cont cont possessiveQuest p cont = possessiveOpt p cont cont Altering p to return a useful value is left to the user.

Stephane Bortzmeyer wrote:
Parsec provides "count n p" to run the parser p exactly n times. I'm looking for a combinator "countBetween m n p" which will run the parser between m and n times. It does not exist in Parsec.
Much to my surprise, it seems quite difficult to write it myself and, until now, I failed (the best result I had was with the "option" combinator, which unfortunately requires a dummy value, returned when the parser fails).
Does anyone has a solution? Preferrably one I can understand, which means not yet with liftM :-
The problem with the other solutions posted so far is for (countBetween 0 100 p) they will always try to run p 100 times, regardless of when it starts to fail. This can be made more efficient by stopping when p first fails. Also, you probably have to use "try p" for the (n-m) cases, so if p consumes some input it does not cause the whole operation to fail. import Text.ParserCombinators.Parsec -- This stops when p first fails, using option to hide <|> countBetween m n p | m<=n = do xs <- count m p let loop 0 acc = return (acc []) loop i acc = do -- mx <- option Nothing (liftM Just p) mx <- option Nothing (do x <- try p return (Just x)) case mx of Nothing -> return (acc []) Just x -> loop (pred i) (acc . (x:)) ys <- loop (n-m) id return (xs++ys) -- This also works and uses <|> directly instead of via option countBetween' m n p | m<=n = do xs <- count m try p let loop 0 acc = return (acc []) loop i acc = (do x<-p loop (pred i) (acc . (x:))) <|> (return (acc [])) ys <- loop (n-m) id return (xs++ys)

I put the "try" in the wrong place. Silly me
-- This also works and uses <|> directly instead of via option countBetween' m n p | m<=n = do xs <- count m p let loop 0 acc = return (acc []) loop i acc = (do x <- try p loop (pred i) (acc . (x:))) <|> (return (acc [])) ys <- loop (n-m) id return (xs++ys)

Stephane Bortzmeyer wrote:
Parsec provides "count n p" to run the parser p exactly n times. I'm looking for a combinator "countBetween m n p" which will run the parser between m and n times. It does not exist in Parsec.
infixr 2 <:> (<:>) = ap . ap (return (:)) countBetween 0 0 _ = return [] countBetween 0 n p = p <:> countBetween 0 (n-1) p <|> return [] countBetween m n p = p <:> countBetween (m-1) (n-1) p (Shortest solution yet, I think. Is primitive recursion somehow out of fashion? Should I rewrite it as two folds?)
Does anyone has a solution? Preferrably one I can understand, which means not yet with liftM :-)
As requested, though I believe a quick 'liftM2' would have been easier than two 'ap's. But if you prefer: a <:> b = do x <- a y <- b return (a:b) or what I'd ordinarily use: (<:>) = liftM2 (:) Udo. -- As Will Rogers would have said, "There is no such thing as a free variable."

Udo Stenzel wrote:
Stephane Bortzmeyer wrote:
Parsec provides "count n p" to run the parser p exactly n times. I'm looking for a combinator "countBetween m n p" which will run the parser between m and n times. It does not exist in Parsec.
infixr 2 <:> (<:>) = ap . ap (return (:))
Again, Parsec requires you to put "try" where you need it
countBetween 0 0 _ = return [] countBetween 0 n p = try p <:> countBetween 0 (n-1) p <|> return [] countBetween m n p = p <:> countBetween (m-1) (n-1) p
or what I'd ordinarily use:
(<:>) = liftM2 (:)

Chris Kuklewicz wrote:
Again, Parsec requires you to put "try" where you need it
I'm pretty sure it does, although this
Udo Stenzel wrote:
countBetween 0 n p = p <:> countBetween 0 (n-1) p <|> return []
is a place where it's not needed in general. You should know what you're doing, though. And I like ReadP better in general, exactly because 'try' is a bit trippy. Udo. -- C'est magnifique, mais ce n'est pas l'Informatique. -- Bosquet [on seeing the IBM 4341]

Udo Stenzel wrote:
Chris Kuklewicz wrote:
Again, Parsec requires you to put "try" where you need it
I'm pretty sure it does, although this
Udo Stenzel wrote:
countBetween 0 n p = p <:> countBetween 0 (n-1) p <|> return []
is a place where it's not needed in general.
No, see my example below. It is not needed in the specific case where p can only consume a single token.
You should know what you're doing, though. And I like ReadP better in general, exactly because 'try' is a bit trippy.
And I have not used ReadP enough to be good with it.
Udo.
As long as you test with 'p' that consumes a single token, there is no difference using try (<:>) = liftM2 (:) countBetween 0 0 _ = return [] countBetween 0 n p = p <:> countBetween 0 (n-1) p <|> return [] countBetween m n p = p <:> countBetween (m-1) (n-1) p Expand countBetween 0 n p = (do value_p <- p value_rest <- countBetween 0 (n-1) p return (value : value_rest) <|> (return []) If p fails without consuming tokens, then the <|> runs (return []). If p fails after consuming tokens, then the <|> will fail instead of running (return []). By replacing p with (try p) this will work.
test = countBetween 3 5 (string "ab") go = runParser test () "" "abababac"
Without try this fails to parse with an error message: Left (line 1, column 7): unexpected "c" expecting "ab" With the try this succeeds: Right ["ab","ab","ab"] I assume you think the former is problem. If you do not then we are trying to define different operations. The problem is that string "ab" consumes the 'a' character/token before it fails. The 'try' wraps this failure, pretending p did not consume any input, so the <|> will then run (return []) as we want. It is not needed in the line countBetween m n p = p <:> countBetween (m-1) (n-1) p because if p fails then the whole operation should fail. There is a benefit to this difficult behavior! The <|> choice point for backtracking is expensive. By committing to the first branch when it consumes a token, the Parsec monad can completely forget the other branch of <|> which allows it to be garbage collected. This default is handy for preventing memory leaks, but it has to be overridden by 'try'. One can package <|> and 'try' in useful combinators such as countBetween so the use of them becomes slightly easier. -- Chris

Udo Stenzel wrote:
Chris Kuklewicz wrote:
Again, Parsec requires you to put "try" where you need it
I'm pretty sure it does, although this
Udo Stenzel wrote:
countBetween 0 n p = p <:> countBetween 0 (n-1) p <|> return []
is a place where it's not needed in general. You should know what you're doing, though. And I like ReadP better in general, exactly because 'try' is a bit trippy.
Udo.
I just tried to mimic regular expression matching with ReadP and got what seems like a non-terminating program. Is there another way to use ReadP to do this?
import Control.Monad import Text.ParserCombinators.ReadP
type R = ReadP Int
-- Consume a specific character, return length 1 c :: Char -> R c x = char x >> return 1
-- Consume like x? x+ x* and return the length quest,plus,star :: R -> R quest x = option 0 x plus x = liftM sum (many1 x) star x = liftM sum (many x)
-- Concatenate two with sum of lengths infixr 5 +> (+>) :: R -> R -> R (+>) x y = liftM2 (+) x y
-- Concatenate list with running total of length match xs = match' xs 0 where match' [] t = return t match' (x:xs) t = do v <- x match' xs $! t+v
-- Simulate "(a?|b+|c*)*d" regular expression test = star (choice [quest (c 'a') ,plus (c 'b') ,star (c 'c')]) +> c 'd'
go foo = readP_to_S test foo
'go' works if I remove the leading 'star' operation from 'test' But 'go' seems to not terminate with the leading 'star' My regex-dfa package has a failure which seems similar, and I have been adding the ability to internally rewrite the pattern to avoid the problem. -- Chris

On 8/30/06, Chris Kuklewicz
-- Simulate "(a?|b+|c*)*d" regular expression
But 'go' seems to not terminate with the leading 'star'
Unless I'm missing something... The part of the pattern inside the parentheses should successfully match at least the empty string at the beginning of the string. Since it's regulated by the second (outer) 'star', it will keep matching as long as it keeps succeeding; since it keeps matching the empty string, it keeps matching forever in the same spot. To solve this problem, your implementation of 'star' could perhaps be changed to answer "no more matches" rather than "infinitely many matches" once the body fails to consume any characters. Hope this helps! --Tom Phoenix

Tom Phoenix wrote:
On 8/30/06, Chris Kuklewicz
wrote: -- Simulate "(a?|b+|c*)*d" regular expression
But 'go' seems to not terminate with the leading 'star'
Unless I'm missing something... The part of the pattern inside the parentheses should successfully match at least the empty string at the beginning of the string. Since it's regulated by the second (outer) 'star', it will keep matching as long as it keeps succeeding; since it keeps matching the empty string, it keeps matching forever in the same spot.
To solve this problem, your implementation of 'star' could perhaps be changed to answer "no more matches" rather than "infinitely many matches" once the body fails to consume any characters.
Hope this helps!
--Tom Phoenix
And that is indeed the solution. But then I wanted $ end-of-line anchors (easy) and ^ begin-of-line anchors (annoying). But it works now: -- | Using ReadP to simulate regular expressions, finding the longest match -- by Chris Kuklewicz, public domain import Control.Monad import Data.Set(Set,member) import Data.Maybe(maybe) import Text.ParserCombinators.ReadP type R = Char -> ReadP (Int,Char) dot :: R dot _ = do x <- get return (1,x) anyOf :: Set Char -> R anyOf s _ = do x <- satisfy (`member` s) return (1,x) noneOf :: Set Char -> R noneOf s _ = do x <- satisfy (not.(`member` s)) return (1,x) c :: Char -> R c x _ = char x >> return (1,x) cs :: String -> R cs [] prev = return (0,prev) cs xs _ = string xs >> return (length xs,last xs) atBOL prev = case prev of '\n' -> return (0,prev) _ -> pfail atEOL prev = do rest <- look case rest of [] -> return (0,prev) ('\n':_) -> return (0,prev) _ -> pfail -- Consume like x? x+ x* and return the length quest,plus,star :: R -> R quest x = x <|> (\prev -> return (0,prev)) plus x = x +> star x star x prev = until0 0 prev where until0 t prev' = do (len,prev'') <- quest x prev' if (0==len) then return (t,prev'') else let tot = t + len in seq tot (until0 tot prev'') upToN :: Int -> R -> R upToN n x = helper n where helper 0 prev t = return (t,prev) helper i prev t = do (len,prev') <- x prev if 0==len then return (t,prev') else helper (pred i) prev' $! t+len ranged 0 Nothing x = star x ranged 0 (Just n) x | n>0 = upToN n x ranged m n x | (m>=0) && maybe True (\n'->n'>=m) n = doSeq (replicate m x) +> (ranged 0 (fmap (subtract m) n) x) | otherwise = (\prev -> return (0,prev)) infixr 6 +> infixr 5 <|> (+>),(<|>) :: R -> R -> R (+>) x y = (\prev -> do (lenX,prev') <- x prev (lenY,prev'') <- y prev' let tot = lenX + lenY seq tot (return (tot,prev'')) ) (<|>) x y = (\prev -> (x prev) +++ (y prev)) orSeq,doSeq :: [R] -> R orSeq [] prev = return (0,prev) orSeq xs prev = foldr1 (<|>) xs $ prev doSeq [] prev = return (0,prev) doSeq xs prev = foldr1 (+>) xs $ prev -- Simulate "(^a|b+|c*|^.)*(d|_rest_)$" regular expression test = star (orSeq [quest (c 'a') ,plus (c 'b') ,star (c 'c') ,atBOL +> dot ]) +> doSeq [c 'd' <|> cs "_rest_",atEOL] go foo = case readP_to_S (gather (test '\n')) foo of [] -> Nothing xs -> Just (last xs)

Chris Kuklewicz wrote:
I just tried to mimic regular expression matching with ReadP and got what seems like a non-terminating program. Is there another way to use ReadP to do this?
-- Simulate "(a?|b+|c*)*d" regular expression test = star (choice [quest (c 'a') ,plus (c 'b') ,star (c 'c')]) +> c 'd'
Indeed, this cannot work. ReadP delivers parses in order of increasing length, and your expression produces infinitely many parses of the empty string, you never get to the interesting matches. I'd say, the best solution is to use an equivalent regex which does not contain something of the form 'many (return x)':
-- Simulate "(a?|b+|c*)*d" regular expression test' = star (choice [(c 'a') ,plus (c 'b') ,plus (c 'c')]) +> c 'd'
Udo. -- f u cn rd ths, u cn gt a gd jb n cmptr prgrmmng.
participants (7)
-
Chris Kuklewicz
-
Henning Thielemann
-
Robert Dockins
-
Stephane Bortzmeyer
-
Tom Phoenix
-
Tomasz Zielonka
-
Udo Stenzel