I'm recently playing with Alex + Happy to try parsing Java (for now I'm only working on the lexer / Alex, so whenever you see "parse" below, that means tokenize). My reference is Java SE 15 Spec, the chapter of interest is
Chapter 3. Lexical Structure (side note: my timing is a bit interesting as I just realized as of writing this, Java SE 16 spec is out just few days ago, so I might switch to that) and now that I've get my hand dirty a bit, I have few questions and hope if someone can shed some light on them:
- For now I'm using "monad-bytestring" wrapper for performance, but now I think maybe String-based wrapper is more appropriate, for it allows me to follow 1. and 2. in "3.2. Lexical Translations" properly before passing the input to Alex - namely I can pre-process input stream to (1) do Unicode escaping to turn the raw byte flow into a flow of Chars (2) I can normalize line terminators into just \n. But:
- Are those two passes (Unicode escape and line terminator normalization) possible within Alex framework?
- Is there any way that I can stick with a memory-compact string representation? (not sure why Alex doesn't provide any Text-based wrapper, as it did mention in its doc that it internally works on UTF-8 encoded byte sequence) I could probably consider to not use any wrapper, as GHC and Agda did, but that's sort of an undocumented territory so I'm hesitated to do so.
- The other trouble I have regarding "3.2. Lexical Translations" is the special rules applied to ">"s: "... There is one exception: if lexical translation occurs in a type context (§4.11) ..." - but how in the world am I going to do this? I mean the lexical analysis is not even done how am I going to tell whether it's a type context (and §4.11 is quite long that I won't even try to read it, unless forced)? Maybe I can just tokenize every ">" as an individual operatior, as if ">>", ">>=", ">>>", and ">>>=" don't exist and worry about that later, but that doesn't sound right to me.
- I realize there's a need for "irrecoverable failure": I have a test case with octal literal "012389", which is supposed to fail, but Alex happily tokenized that into [octal "0123", decimal "89"] - for now my workaround is for every number literal to check whether previous char is a digit and fail if it is indeed so, but I feel this is like ducktaping on a flawed approach and at some point it will fail on some edge cases. an ideal fix IMO would be to have some notion of irrecoverable failure - failing to parse a literal at full should be irrecoverable rather than trying to parse most of it and move on. In addition, as Java spec requires that if a numeric literal doesn't fit in the intended type, it's a compilation error - which can also be encoded as an irrecoverable failure as well. I'm not sure how to do that in Alex though, I can see few ways:
- encode irrecoverable failure by setting to a special startcode, which does not provide anyway to go back to startcode 0 - so an irrecoverable failure sets that special startcode, and at the start of every action, it checks whether startcode is the special "failure" startcode and fail accordingly
- this is similar to startcode, but use a wrapper that supports userstate.
- maybe this is another case that not using a wrapper would give me more control, but I don't have a concrete idea regarding this alternative.
Any thoughts on this is appreciated.