
On Wed, Mar 17, 2021 at 9:02 PM Javran Cheng
Few days after I wrote the OP, I realized I had some misconceptions and now I can answer few of my own questions (I still don't get the second one though):
On Mon, Mar 15, 2021 at 3:42 PM Javran Cheng
wrote: Hi Cafe!
Firstly, sorry for spamming if you already see this on Stack Overflow, I do feel this is too much for a single SO question.
I'm recently playing with Alex + Happy to try parsing Java (for now I'm only working on the lexer / Alex, so whenever you see "parse" below, that means tokenize). My reference is Java SE 15 Spec, the chapter of interest is Chapter 3. Lexical Structure https://docs.oracle.com/javase/specs/jls/se15/html/jls-3.html (side note: my timing is a bit interesting as I just realized as of writing this, Java SE 16 spec is out just few days ago, so I might switch to that) and now that I've get my hand dirty a bit, I have few questions and hope if someone can shed some light on them:
1. For now I'm using "monad-bytestring" wrapper for performance, but now I think maybe String-based wrapper is more appropriate, for it allows me to follow 1. and 2. in "3.2. Lexical Translations" properly before passing the input to Alex - namely I can pre-process input stream to (1) do Unicode escaping to turn the raw byte flow into a flow of Chars (2) I can normalize line terminators into just \n. But: 1. Are those two passes (Unicode escape and line terminator normalization) possible within Alex framework? 2. Is there any way that I can stick with a memory-compact string representation? (not sure why Alex doesn't provide any Text-based wrapper, as it did mention in its doc that it internally works on UTF-8 encoded byte sequence) I could probably consider to not use any wrapper, as GHC and Agda did, but that's sort of an undocumented territory so I'm hesitated to do so.
I'm actually less worried about String-based parsing now - I'm not really keeping a chunk of String and moving it around like data, I'm consuming it, and this is fine as long as backtracking is minimized. And for whatever reason I totally missed "5.2 Basic interface" in Alex's user manual, which answers my "how to use Alex without wrapper" question - I can just run alex command and copy what's necessary from generated source code - now I know I just need the definition of AlexInput, alexGetByte, alexInputPrevChar and whatever transitively required.
1. The other trouble I have regarding "3.2. Lexical Translations" is the special rules applied to ">"s: "... There is one exception: if lexical translation occurs in a type context (§4.11) ..." - but how in the world am I going to do this? I mean the lexical analysis is not even done how am I going to tell whether it's a type context (and §4.11 is quite long that I won't even try to read it, unless forced)? Maybe I can just tokenize every ">" as an individual operatior, as if ">>", ">>=", ">>>", and ">>>=" don't exist and worry about that later, but that doesn't sound right to me.
This one I still don't get - any help is appreciated.
1. I realize there's a need for "irrecoverable failure": I have a test case with octal literal "012389", which is supposed to fail, but Alex happily tokenized that into [octal "0123", decimal "89"] - for now my workaround is for every number literal to check whether previous char is a digit and fail if it is indeed so, but I feel this is like ducktaping on a flawed approach and at some point it will fail on some edge cases. an ideal fix IMO would be to have some notion of irrecoverable failure - failing to parse a literal at full should be irrecoverable rather than trying to parse most of it and move on. In addition, as Java spec requires that if a numeric literal doesn't fit in the intended type, it's a compilation error - which can also be encoded as an irrecoverable failure as well. I'm not sure how to do that in Alex though, I can see few ways: 1. encode irrecoverable failure by setting to a special startcode, which does not provide anyway to go back to startcode 0 - so an irrecoverable failure sets that special startcode, and at the start of every action, it checks whether startcode is the special "failure" startcode and fail accordingly 2. this is similar to startcode, but use a wrapper that supports userstate. 3. maybe this is another case that not using a wrapper would give me more control, but I don't have a concrete idea regarding this alternative.
This is another thing that I totally misunderstood: I wrote the Alex rule
for recognizing octals as "0(0-7)+", of course only "0123" bit of "012389" matches! Instead I should probably just accept a boarder pattern and deal with them in Alex actions (This is similar to just do regex "(\d+\.){3}\d+" and check whether numbers are in 0~255 range for parsing and validating IPv4 address), this way allow me to throw informative errors. And Alex is actually capable of doing "irrecoverable failures", as it's just a newtype around a state function that returns "Either String _". Although I'd like the error type to be finer than String, which now I'm happy to just roll my own without alex wrappers.
This sortah rule is akin to something agda does, roughly , which is treat stuff as a single token unless there’s a white space separator. For that matter, agdas parser may be a good example of a pretty fancy use of happy and Alex. Though it itself was originally based on the happy/Alex parsers in ghc. Any thoughts on this is appreciated.
Thanks! Javran (Fang) Cheng
Cheers, -- Javran (Fang) Cheng _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.