Re[4]: Unicode source files

17 May 2005

      Hello Simon,

Tuesday, May 17, 2005, 5:30:06 PM, you wrote:
...
...
...
The question is what Alex should see for a unicode character: Alex
currently assumes that characters are in the range 0-255 (you need a
fixed range in order to generate the lexer tables).  One possibility
is to map all Unicode upper-case characters to a single character
code for Alex, and similarly for the other classes of character.
i don't know anything about Alex intrinsics, and can only say that any
solution is better to do INSIDE Alex, so other programs using it will
also get Unicode support
SM> The right thing to do as far as Alex is concerned is to collapse the
SM> full Char range onto a smaller number of character classes which are
SM> then lexed using the standard DFA lexer.  Alex could figure out the
SM> required character classes automatically.

SM> However, a simpler solution for GHC would be to essentially do this by
SM> hand, since we already know what the character classes for Haskell are
SM> (upper case, lower case, digit etc.), and we already have some code that
SM> determines character classes for Unicode characters (GHC.Unicode).  So
SM> for example you map upper-case unicode character onto 0xfe, lower-case
SM> onto 0xfd, and so on.

imho this can be made inside Alex as universal solution for all
programs - divide all >127 chars to just several classes: upper,
lower, other letters, spaces, special chars and map them to 0xfe, 0xfd
and so on as you suggests. it will work for a large number of programs
which not pay special attention to separate >127 chars
...
...
btw, Ruby supports writing numbers in form 1_200_000. how about adding
this feature to GHC? ;)
SM> I'm not keen on that.  We don't tend to introduce features that break
SM> Haskell 98 compatibility unless they're quite compelling

i know that such things are not debatable :)  as written in one
book, "there is no sence to decide some problem, if it is known
that this problem have decision" :)

-- 
Best regards,
 Bulat                            mailto:bulatz@HotPOP.com

Re[4]: Unicode source files

Bulat Ziganshin