How to reverse ghc encoding of command line arguments

I have a question about how to reverse the text encoding as done by ghc and the base library for stuff that comes from the command line or the environment. Assume the user's environment specifies a non-Unicode locale, e.g. some latin encoding. In this case, the String we get from e.g. System.Environment.getArgs does *not* contain the Unicode code points of the characters the user has entered. Instead the input bytes are mapped one-to- one to Char. This has probably been done for compatibility reasons, and I do not want to discuss this choice here. Rather, I want to find out how I can convert such a string back to some proper Unicode representation, for instance in order to store the value in a file with a defined encoding such as utf-8. This should be done in a generic way, i.e. without making ad-hoc assumptions about what the user's encoding might be. There is the iconv package. However, it takes ByteString as input and output and it also requires that I give it the encoding as input. How do I find out which is this encoding? On the command line I could simply do ben@sarun[1]: ~ > locale charmap ISO-8859-1 Is there a Haskell function that does the equivalent or do I have to use getEnv "LC_CTYPE", then parse the result? Let's assume I get this to work, so now I have a String that represents the user's encoding, such as "ISO-8859-1". Now, in order to use iconv, I have to convert the string I got via getArgs into a ByteString. But to do that properly, I would have to decode it according to the user's current locale, which is exactly what I want to achieve in the first place. How do I break this cycle? Perhaps it is simpler to write our own getArgs/getEnv functions and directly convert the data we get from the system to a proper (Unicode) String? Any suggestions would be highly appreciated. Cheers Ben -- "Make it so they have to reboot after every typo." -- Scott Adams

If the input bytes are mapped 1-1 to Char values without conversion,
you can just use Data.ByteString.Char8.pack to convert to a
ByteString, which you can then convert to Unicode however you like.
On Sun, Nov 16, 2014 at 5:42 AM, Ben Franksen
I have a question about how to reverse the text encoding as done by ghc and the base library for stuff that comes from the command line or the environment.
Assume the user's environment specifies a non-Unicode locale, e.g. some latin encoding. In this case, the String we get from e.g. System.Environment.getArgs does *not* contain the Unicode code points of the characters the user has entered. Instead the input bytes are mapped one-to- one to Char. This has probably been done for compatibility reasons, and I do not want to discuss this choice here. Rather, I want to find out how I can convert such a string back to some proper Unicode representation, for instance in order to store the value in a file with a defined encoding such as utf-8.
This should be done in a generic way, i.e. without making ad-hoc assumptions about what the user's encoding might be.
There is the iconv package. However, it takes ByteString as input and output and it also requires that I give it the encoding as input. How do I find out which is this encoding? On the command line I could simply do
ben@sarun[1]: ~ > locale charmap ISO-8859-1
Is there a Haskell function that does the equivalent or do I have to use getEnv "LC_CTYPE", then parse the result?
Let's assume I get this to work, so now I have a String that represents the user's encoding, such as "ISO-8859-1". Now, in order to use iconv, I have to convert the string I got via getArgs into a ByteString. But to do that properly, I would have to decode it according to the user's current locale, which is exactly what I want to achieve in the first place.
How do I break this cycle?
Perhaps it is simpler to write our own getArgs/getEnv functions and directly convert the data we get from the system to a proper (Unicode) String?
Any suggestions would be highly appreciated.
Cheers Ben -- "Make it so they have to reboot after every typo." -- Scott Adams
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Carl Howells wrote:
If the input bytes are mapped 1-1 to Char values without conversion, you can just use Data.ByteString.Char8.pack to convert to a ByteString, which you can then convert to Unicode however you like.
Yes, but I cannot be sure this is the case, it depends on the user's locale encoding. Cheers Ben
On Sun, Nov 16, 2014 at 5:42 AM, Ben Franksen
wrote: I have a question about how to reverse the text encoding as done by ghc and the base library for stuff that comes from the command line or the environment.
Assume the user's environment specifies a non-Unicode locale, e.g. some latin encoding. In this case, the String we get from e.g. System.Environment.getArgs does *not* contain the Unicode code points of the characters the user has entered. Instead the input bytes are mapped one-to- one to Char. This has probably been done for compatibility reasons, and I do not want to discuss this choice here. Rather, I want to find out how I can convert such a string back to some proper Unicode representation, for instance in order to store the value in a file with a defined encoding such as utf-8.
This should be done in a generic way, i.e. without making ad-hoc assumptions about what the user's encoding might be.
There is the iconv package. However, it takes ByteString as input and output and it also requires that I give it the encoding as input. How do I find out which is this encoding? On the command line I could simply do
ben@sarun[1]: ~ > locale charmap ISO-8859-1
Is there a Haskell function that does the equivalent or do I have to use getEnv "LC_CTYPE", then parse the result?
Let's assume I get this to work, so now I have a String that represents the user's encoding, such as "ISO-8859-1". Now, in order to use iconv, I have to convert the string I got via getArgs into a ByteString. But to do that properly, I would have to decode it according to the user's current locale, which is exactly what I want to achieve in the first place.
How do I break this cycle?
Perhaps it is simpler to write our own getArgs/getEnv functions and directly convert the data we get from the system to a proper (Unicode) String?
Any suggestions would be highly appreciated.
Cheers Ben -- "Make it so they have to reboot after every typo." -- Scott Adams
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- "Make it so they have to reboot after every typo." -- Scott Adams

On Sun, Nov 16, 2014 at 8:42 AM, Ben Franksen
How do I break this cycle?
Perhaps it is simpler to write our own getArgs/getEnv functions and directly convert the data we get from the system to a proper (Unicode) String?
Ideally there should be a System.Posix.Environment.getArgs that just returns the raw POSIX string (possibly as a ByteString); as with most of POSIX, there is no defined encoding for this, it's octets. If you insist on imposing an encoding on it, you could start from that. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

quoth Ben Franksen
Perhaps it is simpler to write our own getArgs/getEnv functions and directly convert the data we get from the system to a proper (Unicode) String?
I may be confused here - trying this out, I seem to be getting garbage I don't understand from System.Environment getArgs. But there's a System.Posix.Env.ByteString getArgs, that looks like just what you propose above. import qualified Data.ByteString.Char8 as P import qualified System.Posix.Env.ByteString as B import qualified Data.Text as T import Data.Text.Encoding (decodeUtf8, encodeUtf8) argsb <- B.getArgs putStrLn ("byte args: " ++ show(argsb)) let argsu = map (encodeUtf8 . T.pack . P.unpack) argsb putStrLn ("UTF8 byte args: " ++ show(argsu)) $ ./cvtargs [string that perhaps should not be in this email!] byte args: ["S\228kkij\228rven","Polkka"] UTF8 byte args: ["S\195\164kkij\195\164rven","Polkka"] Donn

Donn Cave wrote:
quoth Ben Franksen
... Perhaps it is simpler to write our own getArgs/getEnv functions and directly convert the data we get from the system to a proper (Unicode) String?
I may be confused here - trying this out, I seem to be getting garbage I don't understand from System.Environment getArgs.
But there's a System.Posix.Env.ByteString getArgs, that looks like just what you propose above.
import qualified Data.ByteString.Char8 as P import qualified System.Posix.Env.ByteString as B import qualified Data.Text as T import Data.Text.Encoding (decodeUtf8, encodeUtf8)
argsb <- B.getArgs putStrLn ("byte args: " ++ show(argsb)) let argsu = map (encodeUtf8 . T.pack . P.unpack) argsb putStrLn ("UTF8 byte args: " ++ show(argsu))
$ ./cvtargs [string that perhaps should not be in this email!] byte args: ["S\228kkij\228rven","Polkka"] UTF8 byte args: ["S\195\164kkij\195\164rven","Polkka"]
Cool, I wasn't aware that System.Posix had that function. Now I need to see what to do for Windows... Anyway, many thanks to you and everyone else who offered suggestions. Cheers Ben -- "Make it so they have to reboot after every typo." -- Scott Adams

[... I said earlier ...]
I may be confused here - trying this out, I seem to be getting garbage I don't understand from System.Environment getArgs.
So I returned to this out of curiosity, and specifically, System.Environment getArgs converts common accented characters in ISO-8859-1 command line arguments, into values in the high 0xDC00's. Lower case umlaut u, for example, is 0xDCFC. These values, fed into Data.Text pack and encodeUtf8, seem to be garbage ... I get 3-byte UTF-8 that I highly doubt has anything to do with accented latin characters, actually the same "\239\191\189" even for different chars. But the lower bytes looked like Unicode values, and if the upper 0xDC00 is cleared, Data.Text pack and encodeUtf8 works. I'm no Unicode whiz, maybe this all makes sense? I'm not inconvenienced by this myself, my interest is only academic, just wondering what the extra 0xDC00 bits are for. And I should note that as far as I can make out, this doesn't match the remark at the beginning of this thread: "... does *not* contain the Unicode code points of the characters the user has entered. Instead the input bytes are mapped one-to-one to Char." I have GHC 7.8.3. thanks, Donn

Donn Cave wrote:
[... I said earlier ...]
I may be confused here - trying this out, I seem to be getting garbage I don't understand from System.Environment getArgs.
So I returned to this out of curiosity, and specifically, System.Environment getArgs converts common accented characters in ISO-8859-1 command line arguments, into values in the high 0xDC00's. Lower case umlaut u, for example, is 0xDCFC. These values, fed into Data.Text pack and encodeUtf8, seem to be garbage ... I get 3-byte UTF-8 that I highly doubt has anything to do with accented latin characters, actually the same "\239\191\189" even for different chars.
But the lower bytes looked like Unicode values, and if the upper 0xDC00 is cleared, Data.Text pack and encodeUtf8 works.
I'm no Unicode whiz, maybe this all makes sense? I'm not inconvenienced by this myself, my interest is only academic, just wondering what the extra 0xDC00 bits are for. And I should note that as far as I can make out, this doesn't match the remark at the beginning of this thread: "... does *not* contain the Unicode code points of the characters the user has entered. Instead the input bytes are mapped one-to-one to Char." I have GHC 7.8.3.
Hi Donn I am sorry, I should have replied earlier here to say that I was *wrong*: GHC/base does not by default do what I claimed it does, as I learned later and you confirm now. It does that only if the program expressly demands it by specifying a so-called "char8" encoding, by initializing the global variable localeEncoding before the base library does it for you. With this you can override the user's locale as seen by GHC/base. I was working on Darcs and this is what Darcs does. But I was not aware of this hack and used to local reasoning in Haskell (doesn't Haskell claim to be a purely functional language?). Sorry for the confusion. And thanks for confirming that GHC and the base library do the right thing (if we let them). Cheers Ben -- "Make it so they have to reboot after every typo." -- Scott Adams

Ben Franksen wrote:
GHC/base does not by default do what I claimed it does, as I learned later and you confirm now. It does that only if the program expressly demands it by specifying a so-called "char8" encoding, by initializing the global variable localeEncoding before the base library does it for you. With this you can override the user's locale as seen by GHC/base. I was working on Darcs and this is what Darcs does. But I was not aware of this hack and used to local reasoning in Haskell (doesn't Haskell claim to be a purely functional language?).
I should perhaps add that I was also misled by the documentation in the base library, where at one place it says that setLocaleEncoding does not influence the value you get with getFileSystemEncoding (which is used to decode command line arguments and environment variables). This is true once the base library has initialized the variable, but since the initialization is lazy, as with all globals in Haskell, setLocaleEncoding does have an effect if you do it early enough. Perhaps this might be a worthwhile addition to the docs. Cheers Ben -- "Make it so they have to reboot after every typo." -- Scott Adams

quoth Ben Franksen
Sorry for the confusion. And thanks for confirming that GHC and the base library do the right thing (if we let them).
Hm, that's my question -- how is this the right thing? Umlaut u turns up as 0xFC for UTF-8 users; 0xDCFC, for Latin-1 users. This is an ordinary hello world type program, can't think of any unique environmental issues. - So should we routinely run argv through a high-byte stripper? - I should learn to appreciate the high 0xDC00 byte, because it serves some purpose I wasn't aware of? - Am I somehow messing myself up, and this doesn't normally happen? - Or is the base library really not quite right here? Just curious, mind you! Donn

On Tue, Nov 18, 2014 at 2:30 AM, Donn Cave
quoth Ben Franksen
... Sorry for the confusion. And thanks for confirming that GHC and the base library do the right thing (if we let them).
Hm, that's my question -- how is this the right thing?
This sounds like a fossil. The first version of trying to support locales/encoding on POSIX did that to anything with the 8th bit set, IIRC, rather than make a possibly incorrect guess as to the intended locale (since POSIX does not support locales here; the argument vector is a list of octet strings). You could undo it and apply encoding yourself. I recall there being a "lively" discussion of it back in the day, but not what list it was on (may have been -cafe or libraries). -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

quoth Donn Cave
Umlaut u turns up as 0xFC for UTF-8 users; 0xDCFC, for Latin-1 users. This is an ordinary hello world type program, can't think of any unique environmental issues.
Well, I mischaracterized that problem, so to speak. I find that GHC is not picking up on my "current locale" encoding, and instead seems to be hard-wired to UTF-8. On MacOS X, I can select an encoding in Terminal Preferences, open a new window, and for all intents and purposes it's an ISO8859-1 world, including LANG=en_US.ISO8859-1, but GHC isn't going along with it. So the ISO8859-1 umlaut u is undecodable if GHC is stuck in UTF-8, which seems to explain what I'm seeing. If I understand this right, the 0xDC00 high byte is recognized in some circumstances, and the value is spared from UTF-8 encoding and instead simply copied. Hope that was interesting! Donn

On Wed, Nov 19, 2014 at 7:56 AM, Donn Cave
quoth Donn Cave
... Umlaut u turns up as 0xFC for UTF-8 users; 0xDCFC, for Latin-1 users. This is an ordinary hello world type program, can't think of any unique environmental issues.
Well, I mischaracterized that problem, so to speak.
I find that GHC is not picking up on my "current locale" encoding, and instead seems to be hard-wired to UTF-8. On MacOS X, I can select an encoding in Terminal Preferences, open a new window, and for all intents and purposes it's an ISO8859-1 world, including LANG=en_US.ISO8859-1, but GHC isn't going along with it.
So the ISO8859-1 umlaut u is undecodable if GHC is stuck in UTF-8, which seems to explain what I'm seeing. If I understand this right, the 0xDC00 high byte is recognized in some circumstances, and the value is spared from UTF-8 encoding and instead simply copied.
ISO8859 is not multibyte. And your earlier description is incorrect, in a way showing a common confusion about the relationship between Unicode and UTF8 and ISO8859-1. U+00FC is the Unicode codepoint for u-umlaut. This is, by design, the same as the single byte sequence for u-umlaut (0xFC) in ISO8859-1. It is *not* the UTF8 representation of u-umlaut; that is 0xC3 0xBC. The 0xDC prefix is, as I said earlier, a hack used by ghc. Internally it only uses UTF8; so a non-UTF8 value which it needs to roundtrip from its external representation, which per POSIX has no encoding / is an octet string, to its internal representation is encoded as if it were UTF8 with a 0xDC prefix (stolen; that range belongs to Syriac) and then decoded back to the non-UTF8 external form by stripping the prefix. But this means that you will find yourself working with a "strange" Unicode codepoint. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net
participants (4)
-
Ben Franksen
-
Brandon Allbery
-
Carl Howells
-
Donn Cave