Hi Cafe!

Firstly, sorry for spamming if you already see this on Stack Overflow, I do feel this is too much for a single SO question.

I'm recently playing with Alex + Happy to try parsing Java (for now I'm only working on the lexer / Alex, so whenever you see "parse" below, that means tokenize). My reference is Java SE 15 Spec, the chapter of interest is Chapter 3. Lexical Structure (side note: my timing is a bit interesting as I just realized as of writing this, Java SE 16 spec is out just few days ago, so I might switch to that) and now that I've get my hand dirty a bit, I have few questions and hope if someone can shed some light on them:

For now I'm using "monad-bytestring" wrapper for performance, but now I think maybe String-based wrapper is more appropriate, for it allows me to follow 1. and 2. in "3.2. Lexical Translations" properly before passing the input to Alex - namely I can pre-process input stream to (1) do Unicode escaping to turn the raw byte flow into a flow of Chars (2) I can normalize line terminators into just \n. But:

Are those two passes (Unicode escape and line terminator normalization) possible within Alex framework?
Is there any way that I can stick with a memory-compact string representation? (not sure why Alex doesn't provide any Text-based wrapper, as it did mention in its doc that it internally works on UTF-8 encoded byte sequence) I could probably consider to not use any wrapper, as GHC and Agda did, but that's sort of an undocumented territory so I'm hesitated to do so.

The other trouble I have regarding "3.2. Lexical Translations" is the special rules applied to ">"s: "... There is one exception: if lexical translation occurs in a type context (§4.11) ..." - but how in the world am I going to do this? I mean the lexical analysis is not even done how am I going to tell whether it's a type context (and §4.11 is quite long that I won't even try to read it, unless forced)? Maybe I can just tokenize every ">" as an individual operatior, as if ">>", ">>=", ">>>", and ">>>=" don't exist and worry about that later, but that doesn't sound right to me.
I realize there's a need for "irrecoverable failure": I have a test case with octal literal "012389", which is supposed to fail, but Alex happily tokenized that into [octal "0123", decimal "89"] - for now my workaround is for every number literal to check whether previous char is a digit and fail if it is indeed so, but I feel this is like ducktaping on a flawed approach and at some point it will fail on some edge cases. an ideal fix IMO would be to have some notion of irrecoverable failure - failing to parse a literal at full should be irrecoverable rather than trying to parse most of it and move on. In addition, as Java spec requires that if a numeric literal doesn't fit in the intended type, it's a compilation error - which can also be encoded as an irrecoverable failure as well. I'm not sure how to do that in Alex though, I can see few ways:

encode irrecoverable failure by setting to a special startcode, which does not provide anyway to go back to startcode 0 - so an irrecoverable failure sets that special startcode, and at the start of every action, it checks whether startcode is the special "failure" startcode and fail accordingly
this is similar to startcode, but use a wrapper that supports userstate.
maybe this is another case that not using a wrapper would give me more control, but I don't have a concrete idea regarding this alternative.

Any thoughts on this is appreciated.

Thanks!

Javran (Fang) Cheng