Re: [Haskell-cafe] How to reverse ghc encoding of command line arguments

17 Nov 2014

      Donn Cave wrote:
...
[... I said earlier ...]
...
I may be confused here - trying this out, I seem to be getting
garbage I don't understand from System.Environment getArgs.
So I returned to this out of curiosity, and specifically,
System.Environment getArgs converts common accented characters
in ISO-8859-1 command line arguments, into values in the
high 0xDC00's.  Lower case umlaut u, for example, is 0xDCFC.
These values, fed into Data.Text pack and encodeUtf8, seem
to be garbage ... I get 3-byte UTF-8 that I highly doubt
has anything to do with accented latin characters, actually
the same "\239\191\189" even for different chars.
But the lower bytes looked like Unicode values, and if the
upper 0xDC00 is cleared, Data.Text pack and encodeUtf8 works.
I'm no Unicode whiz, maybe this all makes sense?  I'm not
inconvenienced by this myself, my interest is only academic,
just wondering what the extra 0xDC00 bits are for.  And I
should note that as far as I can make out, this doesn't match
the remark at the beginning of this thread:  "... does *not*
contain the Unicode code points of the characters the user has
entered.  Instead the input bytes are mapped one-to-one to Char."
I have GHC 7.8.3.
Hi Donn

I am sorry, I should have replied earlier here to say that I was *wrong*: 
GHC/base does not by default do what I claimed it does, as I learned later 
and you confirm now. It does that only if the program expressly demands it 
by specifying a so-called "char8" encoding, by initializing the global 
variable localeEncoding before the base library does it for you. With this 
you can override the user's locale as seen by GHC/base. I was working on 
Darcs and this is what Darcs does. But I was not aware of this hack and used 
to local reasoning in Haskell (doesn't Haskell claim to be a purely 
functional language?).

Sorry for the confusion. And thanks for confirming that GHC and the base 
library do the right thing (if we let them).

Cheers
Ben
-- 
"Make it so they have to reboot after every typo." -- Scott Adams

Re: [Haskell-cafe] How to reverse ghc encoding of command line arguments

Ben Franksen