
[... I said earlier ...]
I may be confused here - trying this out, I seem to be getting garbage I don't understand from System.Environment getArgs.
So I returned to this out of curiosity, and specifically, System.Environment getArgs converts common accented characters in ISO-8859-1 command line arguments, into values in the high 0xDC00's. Lower case umlaut u, for example, is 0xDCFC. These values, fed into Data.Text pack and encodeUtf8, seem to be garbage ... I get 3-byte UTF-8 that I highly doubt has anything to do with accented latin characters, actually the same "\239\191\189" even for different chars. But the lower bytes looked like Unicode values, and if the upper 0xDC00 is cleared, Data.Text pack and encodeUtf8 works. I'm no Unicode whiz, maybe this all makes sense? I'm not inconvenienced by this myself, my interest is only academic, just wondering what the extra 0xDC00 bits are for. And I should note that as far as I can make out, this doesn't match the remark at the beginning of this thread: "... does *not* contain the Unicode code points of the characters the user has entered. Instead the input bytes are mapped one-to-one to Char." I have GHC 7.8.3. thanks, Donn