Re: [Haskell-cafe] Some Alex beginner questions regarding parsing Java

18 Mar 2021

      On Wed, Mar 17, 2021 at 9:02 PM Javran Cheng  wrote:
...
Few days after I wrote the OP, I realized I had some misconceptions and
now I can answer few of my own questions (I still don't get the second one
though):
On Mon, Mar 15, 2021 at 3:42 PM Javran Cheng  wrote:
...
Hi Cafe!
Firstly, sorry for spamming if you already see this on Stack Overflow, I
do feel this is too much for a single SO question.
I'm recently playing with Alex + Happy to try parsing Java (for now I'm
only working on the lexer / Alex, so whenever you see "parse" below, that
means tokenize). My reference is Java SE 15 Spec, the chapter of interest
is Chapter 3. Lexical Structure
https://docs.oracle.com/javase/specs/jls/se15/html/jls-3.html (side
note: my timing is a bit interesting as I just realized as of writing this,
Java SE 16 spec is out just few days ago, so I might switch to that) and
now that I've get my hand dirty a bit, I have few questions and hope if
someone can shed some light on them:
1. For now I'm using "monad-bytestring" wrapper for performance, but
   now I think maybe String-based wrapper is more appropriate, for it allows
   me to follow 1. and 2. in "3.2. Lexical Translations" properly before
   passing the input to Alex - namely I can pre-process input stream to (1) do
   Unicode escaping to turn the raw byte flow into a flow of Chars (2) I can
   normalize line terminators into just \n. But:
      1. Are those two passes (Unicode escape and line terminator
      normalization) possible within Alex framework?
      2. Is there any way that I can stick with a memory-compact string
      representation? (not sure why Alex doesn't provide any Text-based wrapper,
      as it did mention in its doc that it internally works on UTF-8 encoded byte
      sequence) I could probably consider to not use any wrapper, as GHC and Agda
      did, but that's sort of an undocumented territory so I'm hesitated to do so.
I'm actually less worried about String-based parsing now - I'm not really
keeping a chunk of String and moving it around like data, I'm consuming it,
and this is fine as long as backtracking is minimized.
And for whatever reason I totally missed "5.2 Basic interface" in Alex's
user manual, which answers my "how to use Alex without wrapper" question -
I can just run alex command and copy what's necessary from generated source
code - now I know I just need the definition of AlexInput, alexGetByte,
alexInputPrevChar and whatever transitively required.
...
1. The other trouble I have regarding "3.2. Lexical Translations" is
   the special rules applied to ">"s: "... There is one exception: if
   lexical translation occurs in a type context (§4.11) ..." - but how in the
   world am I going to do this? I mean the lexical analysis is not even done
   how am I going to tell whether it's a type context (and §4.11 is quite long
   that I won't even try to read it, unless forced)? Maybe I can just tokenize
   every ">" as an individual operatior, as if ">>", ">>=", ">>>", and ">>>="
   don't exist and worry about that later, but that doesn't sound right to me.
This one I still don't get - any help is appreciated.
...
1. I realize there's a need for "irrecoverable failure": I have a
   test case with octal literal "012389", which is supposed to fail, but
   Alex happily tokenized that into [octal "0123", decimal "89"] - for
   now my workaround is for every number literal to check whether previous
   char is a digit and fail if it is indeed so, but I feel this is like
   ducktaping on a flawed approach and at some point it will fail on some edge
   cases. an ideal fix IMO would be to have some notion of irrecoverable
   failure - failing to parse a literal at full should be irrecoverable rather
   than trying to parse most of it and move on. In addition, as Java spec
   requires that if a numeric literal doesn't fit in the intended type, it's a
   compilation error - which can also be encoded as an irrecoverable failure
   as well. I'm not sure how to do that in Alex though, I can see few ways:
      1. encode irrecoverable failure by setting to a special startcode,
      which does not provide anyway to go back to startcode 0 - so an
      irrecoverable failure sets that special startcode, and at the start of
      every action, it checks whether startcode is the special "failure"
      startcode and fail accordingly
      2. this is similar to startcode, but use a wrapper that supports
      userstate.
      3. maybe this is another case that not using a wrapper would give
      me more control, but I don't have a concrete idea regarding this
      alternative.
This is another thing that I totally misunderstood: I wrote the Alex rule
for recognizing octals as "0(0-7)+", of course only "0123" bit of "012389"
matches! Instead I should probably just accept a boarder pattern and deal
with them in Alex actions (This is similar to just do regex "(\d+\.){3}\d+"
and check whether numbers are in 0~255 range for parsing and validating
IPv4 address), this way allow me to throw informative errors. And Alex is
actually capable of doing "irrecoverable failures", as it's just a newtype
around a state function that returns "Either String _". Although I'd like
the error type to be finer than String, which now I'm happy to just roll my
own without alex wrappers.
This sortah rule is akin to something agda does, roughly , which is treat
stuff as a single token unless there’s a white space separator.

For that matter, agdas parser may be a good example of a pretty fancy use
of happy and Alex. Though it itself was originally based on the happy/Alex
parsers in ghc.

Any thoughts on this is appreciated.
...
...
Thanks!
Javran (Fang) Cheng
Cheers,
--
Javran (Fang) Cheng
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

Re: [Haskell-cafe] Some Alex beginner questions regarding parsing Java

Carter Schonwald