Re: Alex unicode trick

7 Jan 2014


      On 07/01/14 14:38, Simon Marlow wrote:
...
Krasimir is right, it would be hard to use Alex's built-in Unicode
support because we have to automatically generate the character classes
from the Unicode spec somehow.  Probably Alex ought to include these as
built-in macros, but right now it doesn't.
Even if we did have access to the right regular expressions, I'm
slightly concerned that the generated state machine might be enormous.
Cheers,
  Simon
On 07/01/2014 08:26, Krasimir Angelov wrote:
...
Hi,
I was recenly looking at this code to see how the lexer decides that a
character is a letter, space, etc. The problem is that with Unicode
there are hundreds of thousands of characters that are declared to be
alphanumeric. Even if they are compressed into a regular expression
with a list of ranges there will be still ~390 ranges. The GHC lexer
avoids hardcoding this ranges by calling isSpace, isAlpha, etc and
then converting this result to a code. Ideally it would be nice if
Alex had a predefined macroses corresponding to the Unicode
categories, but for now you have to either hard code the ranges with
huge regular expressions or use the workaround used in GHC. Is there
any other solution?
Regards,
   Krasimir
Ah, I think I understand now. If this is the case, at least the
‘alexGetChar’ could be removed, right? Is Alex 2.x compatibility
necessary for any reason whatsoever?

--
Mateusz K.

Re: Alex unicode trick

Mateusz Kowalczyk