Character escape codes in Parsec

Hi all, I'm using the version of Text.ParserCombinators.Parsec.Token that ships with GHC 6.8.2. I'm trying to use the stringLiteral parser that makeTokenParser uses by default to parse strings that may contain hex ASCII escape codes of the form: \x [hex-digit] [hex-digit] For example, "\x0a" should parse to "\n". On the other hand, "\x0aE" should parse to "\nE". I don't get this behavior with Parsec, though: "\x0aE" parses to "\174". I can tell from looking at the source that a hex escape code is defined to be "\x" followed by an arbitrary-length sequence of hex digits. I think this is wrong, because in ASCII, such an escape sequence has exactly two digits. I can't exactly tell from looking at the source, but is the default token parser in Parsec supposed to be parsing ASCII strings? If so, isn't this a bug in Parsec? If not, and it's meant to be able to handle Unicode or something, I think the documentation should be clearer. I'd much welcome any enlightenment. Thanks, Tim -- Tim Chevalier * http://cs.pdx.edu/~tjc * Often in error, never in doubt "and the things I'm working on are invisible to everyone"--Meg Hutchinson

On Sat, 29 Mar 2008, Tim Chevalier wrote:
I can't exactly tell from looking at the source, but is the default token parser in Parsec supposed to be parsing ASCII strings? If so, isn't this a bug in Parsec? If not, and it's meant to be able to handle Unicode or something, I think the documentation should be clearer.
My guess is it's not particularly "supposed" to be doing much, but it munches Char and so Unicode input is a distinct possibility. I'd also guess that after the \ it just reuses the existing hex literal function which definitely does need to handle arbitrary length . My approach is simple: when in doubt, write your own lexing functions - at least that way any nasty surprises are your own fault! -- flippa@flippac.org Performance anxiety leads to premature optimisation

On 3/30/08, Philippa Cowderoy
My guess is it's not particularly "supposed" to be doing much, but it munches Char and so Unicode input is a distinct possibility. I'd also guess that after the \ it just reuses the existing hex literal function which definitely does need to handle arbitrary length .
Well, yes, but I'd like to know what the spec is. If there is one, anyway!
My approach is simple: when in doubt, write your own lexing functions - at least that way any nasty surprises are your own fault!
I'll do so if it comes to that, but if it's truly a bug, I'd prefer it to be fixed so that everyone can enjoy the results (even if it's me who ends up fixing it :-) Cheers, Tim -- Tim Chevalier * http://cs.pdx.edu/~tjc * Often in error, never in doubt "Accordingly, computer scientists commonly choose models which have bottoms, but prefer them topless." -- Davey & Priestley, _Introduction to Lattices and Order_

On Sun, 30 Mar 2008, Tim Chevalier wrote:
On 3/30/08, Philippa Cowderoy
wrote: My guess is it's not particularly "supposed" to be doing much, but it munches Char and so Unicode input is a distinct possibility. I'd also guess that after the \ it just reuses the existing hex literal function which definitely does need to handle arbitrary length .
Well, yes, but I'd like to know what the spec is. If there is one, anyway!
Having just poked at the manual (http://legacy.cs.uu.nl/daan/download/parsec/parsec.html if you've not found it already), the spec for anything not specifiable in a LanguageDef is "per the Haskell lexing rules". It implements that correctly. -- flippa@flippac.org The task of the academic is not to scale great intellectual mountains, but to flatten them.
participants (2)
-
Philippa Cowderoy
-
Tim Chevalier