RE: Unicode in GHC: need more advice

17 Jan 2005

      On 14 January 2005 12:58, Dimitry Golubovsky wrote:
...
Now I need more advice on which "flavor" of Unicode support to
implement. In Haskell-cafe, there were 3 flavors summarized: I am
reposting the table here (its latest version).
|Sebastien's| Marcin's | Hugs
     -------+-----------+----------+------
      alnum | L* N*     | L* N*    | L*, M*, N* <1>
      alpha | L*        | L*       | L* <1>
      cntrl | Cc        | Cc Zl Zp | Cc
      digit | N*        | Nd       | '0'..'9'
      lower | Ll        | Ll       | Ll <1>
      punct | P*        | P*       | P*
      upper | Lu        | Lt Lu    | Lu Lt <1>
      blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0
                          U+00A0
                          U+2007
                          U+202F)
                          \t\n\v\f\r U+0085
<1>: for characters outside Latin1 range. For Latin1 characters
(0 to 255), there is a lookup table defined as
"unsigned char   charTable[NUM_LAT1_CHARS];"
I did not post the contents of the table Hugs uses for the Latin1
part. However, with that table completely removed, Hugs did not work
properly. So its contents somehow differs from what Unicode defines
for that character range. If needed, I may decode that table and post
its mapping of character categories (keeping in mind that those are
Haskell-recognized character categories, not Unicode)
I don't know enough to comment on which of the above flavours is best.
However, I'd prefer not to use a separate table for Latin-1 characters
if possible.

We should probably stick to the Report definitions for isDigit and
isSpace, but we could add a separate isUniDigit/isUniSpace for the full
Unicode classes.
...
One more question that I had when experimenting with Hugs: if a
character (like those extra blank chars) is forced into some category
for the purposes of Haskell language compilation (per the Report),
does this mean that any other Haskell application should recognize
Haskell-defined category of that character rather than
Unicode-defined?
For Hugs, there were no choice but say Yes, because both compiler and
interpreter used the same code to decide on character category. In GHC
this may be different.
To be specific: the Report requires that the Haskell lexical class of
space characters includes Unicode spaces, but that the implementation of
isSpace only recognises Latin-1 spaces.  That means we need two separate
classes of space characters (or just use the report definition of
isSpace).

GHC's parser doesn't currently use the Data.Char character class
predicates, but at some point we will want to parse Unicode so we'll
need appropriate class predicates then.
...
Since Hugs got there first, does it make sense just follow what was
done here, or will a different decision be adopted for GHC: say, for
the Parser, extra characters are forced to be blank, but for the rest
of the programs compiled by GHC, Unicode definitions are adhered to.
Does what I said above help answer this question?

Cheers,
	Simon

Simon Marlow

tags

participants (1)