
Hello Simon, Tuesday, May 17, 2005, 5:30:06 PM, you wrote:
The question is what Alex should see for a unicode character: Alex currently assumes that characters are in the range 0-255 (you need a fixed range in order to generate the lexer tables). One possibility is to map all Unicode upper-case characters to a single character code for Alex, and similarly for the other classes of character.
i don't know anything about Alex intrinsics, and can only say that any solution is better to do INSIDE Alex, so other programs using it will also get Unicode support
SM> The right thing to do as far as Alex is concerned is to collapse the SM> full Char range onto a smaller number of character classes which are SM> then lexed using the standard DFA lexer. Alex could figure out the SM> required character classes automatically. SM> However, a simpler solution for GHC would be to essentially do this by SM> hand, since we already know what the character classes for Haskell are SM> (upper case, lower case, digit etc.), and we already have some code that SM> determines character classes for Unicode characters (GHC.Unicode). So SM> for example you map upper-case unicode character onto 0xfe, lower-case SM> onto 0xfd, and so on. imho this can be made inside Alex as universal solution for all programs - divide all >127 chars to just several classes: upper, lower, other letters, spaces, special chars and map them to 0xfe, 0xfd and so on as you suggests. it will work for a large number of programs which not pay special attention to separate >127 chars
btw, Ruby supports writing numbers in form 1_200_000. how about adding this feature to GHC? ;)
SM> I'm not keen on that. We don't tend to introduce features that break SM> Haskell 98 compatibility unless they're quite compelling i know that such things are not debatable :) as written in one book, "there is no sence to decide some problem, if it is known that this problem have decision" :) -- Best regards, Bulat mailto:bulatz@HotPOP.com