Re: [Haskell-cafe] Hugs vs GHC (again) was: Re: Some random newbiequestions

-------- Original Message --------
Subject: Re: [Haskell-cafe] Hugs vs GHC (again) was: Re: Some random newbiequestions
Date: Mon, 10 Jan 2005 20:47:26 -0500
From: Dimitry Golubovsky
It's not obvious what the predicates should really mean, e.g. should isDigit and isHexDigit include non-ASCII digits or should isSpace include non-breaking space characters.
I think perhaps the answer is all of the above. The functions could be defined in multiple modules, so that 'ASCII.isSpace' would match the "normal" space character only, while 'Unicode.isSpace' could match all the weird and wonderful stuff in the standard.
So there might be a bunch of (perhaps autogenerated, from localedef files) modules for each locale/encoding, like ISO8859_1 or KOI_8. These modules might be imported into applications as needed. Also there would be one module autogenerated from the Unicode data files. Dimitry Golubovsky Middletown, CT

Dimitry Golubovsky
|Sebastien's| Marcin's | Hugs -------+-----------+----------+------ alnum | L* N* | L* N* | L*, M*, N* <1> alpha | L* | L* | L* <1> cntrl | Cc | Cc Zl Zp | Cc digit | N* | Nd | '0'..'9' lower | Ll | Ll | Ll <1> punct | P* | P* | P* upper | Lu | Lt Lu | Lu Lt <1> blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0 U+00A0 U+2007 U+202F) \t\n\v\f\r U+0085
<1>: for characters outside Latin1 range. For Latin1 characters (0 to 255), there is a lookup table defined as "unsigned char charTable[NUM_LAT1_CHARS];"
If the table coincides with Unicode character category, then it's just an implementation detail. I changed c < ' ' || c >= '\DEL' && c <= '\x9f' to "Cc" for Hugs because it's the same.
So there might be a bunch of (perhaps autogenerated, from localedef files) modules for each locale/encoding, like ISO8859_1 or KOI_8.
I disagree. Char is supposed to mean Unicode only, and data is converted to Unicode on boundaries with those parts of the world which use different encodings. With Unicode in mind it still makes sense to talk about digits as '0'..'9' only; most programming languages specify numeric literals as constisting of these digits only. Other contexts may require a wider set, including today's Arabic digits etc. This is not because of the encoding but because of the intended set of characters. One reason why the predicates are not obvious is that when the features encodable as text become more sophisticated, old algorithms for handling text become limited. For example if an identifier is specified as a letter followed by a sequence of letters or numbers, then combining marks are not allowed in identifiers, even though the corresponding precomposed characters are allowed. I guess this is why Hugs includes M* in isAlphaNum. This is a lie which allows old code to work better. These characters are not alphanumeric; it's the definition of identifiers which is no longer appropriate. (Unicode recommends a particular definition of identifiers in programming languages which want to permit with non-ASCII identifiers; it has various exceptions because it's intended to be somehow compatible with older versions of itself.) Another case when old interfaces are not sufficient is toUpper & toLower. They should be defined on strings, not characters. Besides 'ß' there are other characters which uppercase or lowercase to several code points: ligatures, precomposed characters which lack precomposed variants in the other case but can be decomposed, Greek iota below which is specified to uppercase to a separate iota after the letter (some people believe this is wrong but it's how it's currently specified in Unicode) and some cases with accents over I and i. Case mapping is also context-dependent for sigma. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk
Dimitry Golubovsky
writes:
[Proposal: ASCII.isDigit is true for '0'..'9', Unicode.isDigit is true for whatever Unicode defines as digits]
So there might be a bunch of (perhaps autogenerated, from localedef files) modules for each locale/encoding, like ISO8859_1 or KOI_8.
I disagree. Char is supposed to mean Unicode only, and data is converted to Unicode on boundaries with those parts of the world which use different encodings.
...and uppercase chars in KOI_8 is a subset of uppercase chars in Unicode, so a KOI_8-specific isUpper would be superflous(?) My intention was only to separate between (the traditional) ASCII and (our modern day tower of Babel) Unicode. One possibility could be to have locale-modules apply to (raw) Word8 data -- so somebody writing for KOI_8 could avoid converting to UC Char at all. I'm not sure this is something we want, though. -kzm -- If I haven't seen further, it is by standing in the footprints of giants
participants (3)
-
Dimitry Golubovsky
-
Ketil Malde
-
Marcin 'Qrczak' Kowalczyk