RE: Unicode in GHC: need more advice

On 14 January 2005 12:58, Dimitry Golubovsky wrote:
Now I need more advice on which "flavor" of Unicode support to implement. In Haskell-cafe, there were 3 flavors summarized: I am reposting the table here (its latest version).
|Sebastien's| Marcin's | Hugs -------+-----------+----------+------ alnum | L* N* | L* N* | L*, M*, N* <1> alpha | L* | L* | L* <1> cntrl | Cc | Cc Zl Zp | Cc digit | N* | Nd | '0'..'9' lower | Ll | Ll | Ll <1> punct | P* | P* | P* upper | Lu | Lt Lu | Lu Lt <1> blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0 U+00A0 U+2007 U+202F) \t\n\v\f\r U+0085
<1>: for characters outside Latin1 range. For Latin1 characters (0 to 255), there is a lookup table defined as "unsigned char charTable[NUM_LAT1_CHARS];"
I did not post the contents of the table Hugs uses for the Latin1 part. However, with that table completely removed, Hugs did not work properly. So its contents somehow differs from what Unicode defines for that character range. If needed, I may decode that table and post its mapping of character categories (keeping in mind that those are Haskell-recognized character categories, not Unicode)
I don't know enough to comment on which of the above flavours is best. However, I'd prefer not to use a separate table for Latin-1 characters if possible. We should probably stick to the Report definitions for isDigit and isSpace, but we could add a separate isUniDigit/isUniSpace for the full Unicode classes.
One more question that I had when experimenting with Hugs: if a character (like those extra blank chars) is forced into some category for the purposes of Haskell language compilation (per the Report), does this mean that any other Haskell application should recognize Haskell-defined category of that character rather than Unicode-defined?
For Hugs, there were no choice but say Yes, because both compiler and interpreter used the same code to decide on character category. In GHC this may be different.
To be specific: the Report requires that the Haskell lexical class of space characters includes Unicode spaces, but that the implementation of isSpace only recognises Latin-1 spaces. That means we need two separate classes of space characters (or just use the report definition of isSpace). GHC's parser doesn't currently use the Data.Char character class predicates, but at some point we will want to parse Unicode so we'll need appropriate class predicates then.
Since Hugs got there first, does it make sense just follow what was done here, or will a different decision be adopted for GHC: say, for the Parser, extra characters are forced to be blank, but for the rest of the programs compiled by GHC, Unicode definitions are adhered to.
Does what I said above help answer this question? Cheers, Simon
participants (1)
-
Simon Marlow