[Haskell-cafe] Some Alex beginner questions regarding parsing Java

15 Mar 2021

      Hi Cafe!

Firstly, sorry for spamming if you already see this on Stack Overflow, I do
feel this is too much for a single SO question.

I'm recently playing with Alex + Happy to try parsing Java (for now I'm
only working on the lexer / Alex, so whenever you see "parse" below, that
means tokenize). My reference is Java SE 15 Spec, the chapter of interest
is Chapter 3. Lexical Structure
https://docs.oracle.com/javase/specs/jls/se15/html/jls-3.html (side note:
my timing is a bit interesting as I just realized as of writing this, Java
SE 16 spec is out just few days ago, so I might switch to that) and now
that I've get my hand dirty a bit, I have few questions and hope if someone
can shed some light on them:

   1. For now I'm using "monad-bytestring" wrapper for performance, but now
   I think maybe String-based wrapper is more appropriate, for it allows me to
   follow 1. and 2. in "3.2. Lexical Translations" properly before passing the
   input to Alex - namely I can pre-process input stream to (1) do Unicode
   escaping to turn the raw byte flow into a flow of Chars (2) I can normalize
   line terminators into just \n. But:
      1. Are those two passes (Unicode escape and line terminator
      normalization) possible within Alex framework?
      2. Is there any way that I can stick with a memory-compact string
      representation? (not sure why Alex doesn't provide any
Text-based wrapper,
      as it did mention in its doc that it internally works on UTF-8
encoded byte
      sequence) I could probably consider to not use any wrapper, as
GHC and Agda
      did, but that's sort of an undocumented territory so I'm
hesitated to do so.
   2. The other trouble I have regarding "3.2. Lexical Translations" is the
   special rules applied to ">"s: "... There is one exception: if lexical
   translation occurs in a type context (§4.11) ..." - but how in the world am
   I going to do this? I mean the lexical analysis is not even done how am I
   going to tell whether it's a type context (and §4.11 is quite long that I
   won't even try to read it, unless forced)? Maybe I can just tokenize every
   ">" as an individual operatior, as if ">>", ">>=", ">>>", and ">>>=" don't
   exist and worry about that later, but that doesn't sound right to me.
   3. I realize there's a need for "irrecoverable failure": I have a test
   case with octal literal "012389", which is supposed to fail, but Alex
   happily tokenized that into [octal "0123", decimal "89"] - for now my
   workaround is for every number literal to check whether previous char is a
   digit and fail if it is indeed so, but I feel this is like ducktaping on a
   flawed approach and at some point it will fail on some edge cases. an ideal
   fix IMO would be to have some notion of irrecoverable failure - failing to
   parse a literal at full should be irrecoverable rather than trying to parse
   most of it and move on. In addition, as Java spec requires that if a
   numeric literal doesn't fit in the intended type, it's a compilation error
   - which can also be encoded as an irrecoverable failure as well. I'm not
   sure how to do that in Alex though, I can see few ways:
      1. encode irrecoverable failure by setting to a special startcode,
      which does not provide anyway to go back to startcode 0 - so an
      irrecoverable failure sets that special startcode, and at the start of
      every action, it checks whether startcode is the special "failure"
      startcode and fail accordingly
      2. this is similar to startcode, but use a wrapper that supports
      userstate.
      3. maybe this is another case that not using a wrapper would give me
      more control, but I don't have a concrete idea regarding this alternative.

Any thoughts on this is appreciated.

Thanks!
Javran (Fang) Cheng

[Haskell-cafe] Some Alex beginner questions regarding parsing Java

Javran Cheng