I recently started writing my first application at work in Haskell and it deals with a lot of parsing.
Among other things I often have to check for a lot of alternatives for fixed strings (parsing natural
language text and people have a lot of ways to abbreviate the same thing in labels). So far I have been
doing this basically via
choice $ map (try . string) [ "foo", "bar", ... ]
This works fine but has two disadvantages, it isn't very fast, in particular when many of the strings
start with the same prefix and it also is a bit error prone since it breaks when you place a prefix of
another string earlier in the list.
My attempt at a solution was to use the bytestring-trie package for a little utility function that basically
parses one character at a time, checks if the string parsed so far is in the trie and then calls itself recursively
with the trie starting with that string. My attempt at that so far looks like this:
(dependencies bytestring-trie, utf8-string and parsec 3)
import qualified Data.ByteString.UTF8 as UTF8
import qualified Data.Trie as Trie
import Text.Parsec
import Text.Parsec.Text (GenParser)
anyOf :: [String] -> GenParser u String
anyOf l =
try $ anyOf' t ""
where t = Trie.fromList $ zip (map UTF8.fromString l) (repeat ())
anyOf' :: Trie.Trie () -> String -> GenParser u String
anyOf' t s = try $ do
c <- lookAhead $ anyChar
let newS = s ++ [ c ] in
case Trie.submap (UTF8.fromString newS) t of
emptyT | Trie.null emptyT ->
case Trie.member (UTF8.fromString s) t of
True ->
return s
False ->
unexpected $ show newS++", expecting one of "++show l
restT -> do
_ <- anyChar
try $ anyOf' restT newS
A successful example usage would be:
parseTest (do; r1 <- anyOf ["Hello", "Hallo", "Foo", "HallofFame"]; r2 <- string "bla"; return (r1, r2)) "Hallobla"
which results in
("Hallo","bla")
(the extra string parser is there so errors in parsing too much are not hidden). An error would result .e.g. from
parseTest (do; r1 <- anyOf ["Hello", "Hallo", "Foo", "HallofFame"]; r2 <- string "bla"; return (r1, r2)) "Hallofbla"
which prints this:
parse error at (line 1, column 8):unknown parse error
And my question about this is made up of two parts
1. Why doesn't it print my "unexpected" message but instead says unknown parse error
2. Why is the location in the text off (I would expect it to fail at
column 6 (first character beyond the result it could return) or 7 (first
character that makes the string no prefix of any acceptable string)
I
am afraid my knowledge of Parsec internals is a bit too limited, some
Google queries showed no similar problems and no obvious places in the
Parsec source code to check for the answer to the first question in
particular and I suspect the second is closely related to the first.
Thanks for reading through my question and I hope someone knows the answer or at least some clues on where i might find it.
Matthias Hoermann
P.S.: I am hoping this time this works, last time it was rejected because google sends with @googlemail.com instead of @gmail.com for some reason.