
On Wed, Nov 19, 2014 at 7:56 AM, Donn Cave
quoth Donn Cave
... Umlaut u turns up as 0xFC for UTF-8 users; 0xDCFC, for Latin-1 users. This is an ordinary hello world type program, can't think of any unique environmental issues.
Well, I mischaracterized that problem, so to speak.
I find that GHC is not picking up on my "current locale" encoding, and instead seems to be hard-wired to UTF-8. On MacOS X, I can select an encoding in Terminal Preferences, open a new window, and for all intents and purposes it's an ISO8859-1 world, including LANG=en_US.ISO8859-1, but GHC isn't going along with it.
So the ISO8859-1 umlaut u is undecodable if GHC is stuck in UTF-8, which seems to explain what I'm seeing. If I understand this right, the 0xDC00 high byte is recognized in some circumstances, and the value is spared from UTF-8 encoding and instead simply copied.
ISO8859 is not multibyte. And your earlier description is incorrect, in a way showing a common confusion about the relationship between Unicode and UTF8 and ISO8859-1. U+00FC is the Unicode codepoint for u-umlaut. This is, by design, the same as the single byte sequence for u-umlaut (0xFC) in ISO8859-1. It is *not* the UTF8 representation of u-umlaut; that is 0xC3 0xBC. The 0xDC prefix is, as I said earlier, a hack used by ghc. Internally it only uses UTF8; so a non-UTF8 value which it needs to roundtrip from its external representation, which per POSIX has no encoding / is an octet string, to its internal representation is encoded as if it were UTF8 with a 0xDC prefix (stolen; that range belongs to Syriac) and then decoded back to the non-UTF8 external form by stripping the prefix. But this means that you will find yourself working with a "strange" Unicode codepoint. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net