
Donn Cave wrote:
[... I said earlier ...]
I may be confused here - trying this out, I seem to be getting garbage I don't understand from System.Environment getArgs.
So I returned to this out of curiosity, and specifically, System.Environment getArgs converts common accented characters in ISO-8859-1 command line arguments, into values in the high 0xDC00's. Lower case umlaut u, for example, is 0xDCFC. These values, fed into Data.Text pack and encodeUtf8, seem to be garbage ... I get 3-byte UTF-8 that I highly doubt has anything to do with accented latin characters, actually the same "\239\191\189" even for different chars.
But the lower bytes looked like Unicode values, and if the upper 0xDC00 is cleared, Data.Text pack and encodeUtf8 works.
I'm no Unicode whiz, maybe this all makes sense? I'm not inconvenienced by this myself, my interest is only academic, just wondering what the extra 0xDC00 bits are for. And I should note that as far as I can make out, this doesn't match the remark at the beginning of this thread: "... does *not* contain the Unicode code points of the characters the user has entered. Instead the input bytes are mapped one-to-one to Char." I have GHC 7.8.3.
Hi Donn I am sorry, I should have replied earlier here to say that I was *wrong*: GHC/base does not by default do what I claimed it does, as I learned later and you confirm now. It does that only if the program expressly demands it by specifying a so-called "char8" encoding, by initializing the global variable localeEncoding before the base library does it for you. With this you can override the user's locale as seen by GHC/base. I was working on Darcs and this is what Darcs does. But I was not aware of this hack and used to local reasoning in Haskell (doesn't Haskell claim to be a purely functional language?). Sorry for the confusion. And thanks for confirming that GHC and the base library do the right thing (if we let them). Cheers Ben -- "Make it so they have to reboot after every typo." -- Scott Adams