Re: [Haskell-i18n] Surrogate pairs?

At 2002-08-21 00:17, Ketil Z. Malde wrote:
\#00E1 [LATIN SMALL LETTER A WITH ACUTE]
or
\#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]
I guess they must be treated the same, too? That is, the length of the strings should be the same, they should compare equal, etc etc.
In my opinion no. As far as String is concerned, since it is simply [Char], it should be considered as simply a list of codepoints without further interpretation. So 'length' and its instance for Eq should be the same as for any other list.
Or is it an alternative to just ignore the issue, and simply think of the latter as two characters?
Consider the latter as two codepoints, and don't worry about characters. There should be separate functions for doing such things as decomposition and equivalence. -- Ashley Yakeley, Seattle WA

I threw out some suggestions on how to encode non-ascii characters in Haskell source code. Did we conclude anything on this? Looking at the report, we have the following: | Escape characters for the Unicode character set, including control | characters such as \^X, are also provided. Numeric escapes such as | \137 are used to designate the character with decimal representation | 137; octal (e.g. \o137) and hexadecimal (e.g. \x37) representations | are also allowed. Numeric escapes that are out-of-range of the | Unicode standard (16 bits) are an error. Apparently, this isn't quite supported by GHC: Prelude> map Char.ord "\74\749\7490" [74,237,66] which is, of course, the values modulo 256. Anyway, if the report is corrected to not limit us to 16 bits, this at least gives us enough mechanism to use Unicode in string and character constants. What about using it in identifiers? I suggest the following formats: #hhhh and ##hhhhhhhh for Unicode characters, with the first form being applicable to code points below 64K, and the second to all of Unicode. (I still think using LaTeXy or HTMLish syntax as synonyms is a good idea, as in ø {\alpha;} if this could be conveniently incorporated in the compilers, but it's probably not crucial. It'd be nice if my .lhs'es would print the right glyphs in the code, but I suppose this is better handled by LaTeX.) I'd prefer to tackle the layout issue by simply requiring the magic words ('do', 'of', etc.) to always be followed by a line break, but I suppose it *is* possible to have preprocessing software automatically readjust indentation to keep the semantics. In that case, I'd vote for indentation to be a count of actual characters in the code, i.e., #hhhh contributes five to the indentation, but if translated to the character it represents, it contributes one (barring Unicode weirdness, of course). -kzm -- If I haven't seen further, it is by standing in the footprints of giants
participants (2)
-
Ashley Yakeley
-
ketil@ii.uib.no