Has character changed in GHC 6.8?

I vaguely remember that in GHC 6.6 code like this length $ map ord "a string" being able able to generate a different answer than length "a string" At the time I thought that the encoding (in my case UTF-8) was “leaking through”. After switching to GHC 6.8 the behaviour seems to have changed, and mapping 'ord' on a string results in a list of ints representing the Unicode code point rather than the encoding:
map ord "åäö" [229,228,246]
Is this the case, or is there something strange going on with character encodings? I was hoping that this would mean that 'chr . ord' would basically be a no-op, but no such luck:
chr . ord $ 'å' '\229'
What would I have to do to get an 'å' from '229'? /M -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus.therning@gmail.com http://therning.org/magnus What if I don't want to obey the laws? Do they throw me in jail with the other bad monads? -- Daveman

2008/1/22 Magnus Therning
I vaguely remember that in GHC 6.6 code like this
length $ map ord "a string"
being able able to generate a different answer than
length "a string"
I guess it's not very difficult to prove that ∀ f xs. length xs == length (map f xs) even in the presence of seq. -- Felipe.

On Tue, 2008-01-22 at 07:45 -0200, Felipe Lessa wrote:
2008/1/22 Magnus Therning
: I vaguely remember that in GHC 6.6 code like this
length $ map ord "a string"
being able able to generate a different answer than
length "a string"
I guess it's not very difficult to prove that
∀ f xs. length xs == length (map f xs)
even in the presence of seq.
This is the free theorem of length. For it to be wrong, parametric polymorphism would have to be incorrectly implemented. Even seq makes no difference (in this case.)

On Tue, 2008-01-22 at 09:29 +0000, Magnus Therning wrote:
I vaguely remember that in GHC 6.6 code like this
length $ map ord "a string"
being able able to generate a different answer than
length "a string"
That seems unlikely.
At the time I thought that the encoding (in my case UTF-8) was “leaking through”. After switching to GHC 6.8 the behaviour seems to have changed, and mapping 'ord' on a string results in a list of ints representing the Unicode code point rather than the encoding:
Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them as Latin-1.
map ord "åäö" [229,228,246]
Is this the case, or is there something strange going on with character encodings?
That's what we'd expect. Note that GHCi still uses Latin-1. This will change in GHC-6.10.
I was hoping that this would mean that 'chr . ord' would basically be a no-op, but no such luck:
chr . ord $ 'å' '\229'
What would I have to do to get an 'å' from '229'?
Easy! Prelude> 'å' == '\229' True Prelude> 'å' == Char.chr 229 True Remember, when you type: Prelude> 'å' what you really get is: Prelude> putStrLn (show 'å') So perhaps what is confusing you is the Show instance for Char which converts Char -> String into a portable ascii representation. Duncan

On Tue, 22 Jan 2008, Duncan Coutts wrote:
At the time I thought that the encoding (in my case UTF-8) was “leaking through”. After switching to GHC 6.8 the behaviour seems to have changed, and mapping 'ord' on a string results in a list of ints representing the Unicode code point rather than the encoding:
Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them as Latin-1.
Can this be controlled by an option?

On Tue, 2008-01-22 at 13:48 +0100, Henning Thielemann wrote:
On Tue, 22 Jan 2008, Duncan Coutts wrote:
At the time I thought that the encoding (in my case UTF-8) was “leaking through”. After switching to GHC 6.8 the behaviour seems to have changed, and mapping 'ord' on a string results in a list of ints representing the Unicode code point rather than the encoding:
Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them as Latin-1.
Can this be controlled by an option?
From the GHC manual:
GHC assumes that source files are ASCII or UTF-8 only, other encodings are not recognised. However, invalid UTF-8 sequences will be ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only. There is no option to have GHC assume a different encoding. You can use something like iconv to convert .hs files from another encoding into UTF-8. Duncan

Hello Duncan, Tuesday, January 22, 2008, 1:36:44 PM, you wrote:
Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them as Latin-1.
afair, it was changed since 6.6 -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 1/22/08, Duncan Coutts
On Tue, 2008-01-22 at 09:29 +0000, Magnus Therning wrote:
I vaguely remember that in GHC 6.6 code like this
length $ map ord "a string"
being able able to generate a different answer than
length "a string"
That seems unlikely.
Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version currently in Debian Sid):
map ord "a" [97] map ord "ö" [195,182]
Funky, isn't it? ;-) Easy!
Prelude> 'å' == '\229' True Prelude> 'å' == Char.chr 229 True
Remember, when you type: Prelude> 'å'
what you really get is: Prelude> putStrLn (show 'å')
So perhaps what is confusing you is the Show instance for Char which converts Char -> String into a portable ascii representation.
Have you tried putting any of this into GHCi (6.6.1)? Any line with 'å' results in the following for me:
'å' <interactive>:1:2: lexical error in string/character literal at character '\165' "å" "\195\165"
Somewhat disappointing. GHCi 6.8.2 does perform better though. /M

On Tue, Jan 22, 2008 at 03:16:15PM +0000, Magnus Therning wrote:
On 1/22/08, Duncan Coutts
wrote: On Tue, 2008-01-22 at 09:29 +0000, Magnus Therning wrote:
I vaguely remember that in GHC 6.6 code like this
length $ map ord "a string"
being able able to generate a different answer than
length "a string"
That seems unlikely.
Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version currently in Debian Sid):
map ord "a" [97] map ord "ö" [195,182]
In 6.6.1: Prelude Data.Char> map ord "ö" [195,182] Prelude Data.Char> length "ö" 2 there are actually 2 bytes there, but your terminal is showing them as one character. Thanks Ian

Ian Lynagh wrote:
Prelude Data.Char> map ord "ö" [195,182] Prelude Data.Char> length "ö" 2
there are actually 2 bytes there, but your terminal is showing them as one character.
So let's all switch to unicode ASAP and leave that horrible multi-byte-string-thing behind us? Cheers, Peter

Peter Verswyvelen
Prelude Data.Char> map ord "ö" [195,182] Prelude Data.Char> length "ö" 2
there are actually 2 bytes there, but your terminal is showing them as one character.
So let's all switch to unicode ASAP and leave that horrible multi-byte-string-thing behind us?
You are being ironic, I take it? Unicode by its nature implies multi-byte chars, it's just a question of how they are encoded: UTF-8 (one or more bytes, variable), UTF-16 (two or four, variable), or UCS-4 (or should it be UTF-32? - four bytes, fixed). The problem here is that while terminal software have been UTF-8 for some time, GHC only recently caught up. -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde wrote:
So let's all switch to unicode ASAP and leave that horrible multi-byte-string-thing behind us?
You are being ironic, I take it?
No I just used wrong terminology. When I said unicode, I actually meant UCS-x, and with multi-byte-string-thing I meant VARIABLE-length, sorry about that. I find variable length chars so much harder to use and reason about than the fixed length characters. UTF-x is a form of compression, which is understandable, but it is IMHO a burden (since it does not allow random access to the n-th character) Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2 use for [Char]? UCS-32? BTW: According the Wikipedia, UCS-4 and UTF-32 are functionally equivalent. Cheers, Peter

Peter Verswyvelen wrote:
Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2 use for [Char]? UCS-32?
How dare you! Such a personal question! This is none of your business. I jest, but the point is sound: the internal storage of Char is ghc's business, and it should not leak to the programmer. All the programmer needs to know is that Char is capable of storing unicode characters. GHC might choose some custom storage method, including making Char an ADT behind the scenes, or whatever it likes. Other haskell compilers or interpreters are free to choose their own representation. In practice, I believe that for GHC it's a wchar, which is typically a 32bit character with reasonably efficient libc support. What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use "lower 8 bits of unicode code point" which is almost functionally equivalent to latin-1. Jules

On Jan 23, 2008 11:56 AM, Jules Bean
Peter Verswyvelen wrote:
Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2 use for [Char]? UCS-32?
[snip]
What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use "lower 8 bits of unicode code point" which is almost functionally equivalent to latin-1.
Which is terrible! You should have to be explicit about what encoding you expect. Python 3000 does it right. -- Johan

Johan Tibell wrote:
On Jan 23, 2008 11:56 AM, Jules Bean
wrote: Peter Verswyvelen wrote:
Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2 use for [Char]? UCS-32? [snip]
What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use "lower 8 bits of unicode code point" which is almost functionally equivalent to latin-1.
Which is terrible! You should have to be explicit about what encoding you expect. Python 3000 does it right.
No arguments there. Presumably there wasn't a sufficiently good answer available in time for haskell98. Jules

What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use "lower 8 bits of unicode code point" which is almost functionally equivalent to latin-1.
Which is terrible! You should have to be explicit about what encoding you expect. Python 3000 does it right.
Presumably there wasn't a sufficiently good answer available in time for haskell98.
Will there be one for haskell prime ?
The I/O library needs an overhaul but I'm not sure how to do this in a backwards compatible manner which probably would be required for inclusion in Haskell'. One could, like Python 3000, break backwards compatibility. I'm not sure about the implications of doing this. Maybe introducing a new System.IO.Unicode module would be an option. If one wants to keep the interface but change the semantics slightly one could define e.g. getChar as: getChar :: IO Char getChar = getWord8 >>= decodeChar latin1 Assuming latin-1 is what's used now. The benefit would be that if the input is not in latin-1 an exception could be thrown rather than returning a Char representing the wrong Unicode code point. I recommend reading about the Python I/O system overhaul for Python 3000 which is outlined in PEP 3116 http://www.python.org/dev/peps/pep-3116/ My proposal is for I/O functions to specify the encoding they use if they accept or return Chars (and Strings). If they deal in terms of bytes (e.g. socket functions) they should accept and return Word8s. Optionally, text I/O functions could default to the system locale setting. -- Johan

Johan Tibell wrote:
What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use "lower 8 bits of unicode code point" which is almost functionally equivalent to latin-1. Which is terrible! You should have to be explicit about what encoding you expect. Python 3000 does it right. Presumably there wasn't a sufficiently good answer available in time for haskell98. Will there be one for haskell prime ?
The I/O library needs an overhaul but I'm not sure how to do this in a backwards compatible manner which probably would be required for inclusion in Haskell'. One could, like Python 3000, break backwards compatibility. I'm not sure about the implications of doing this. Maybe introducing a new System.IO.Unicode module would be an option.
If one wants to keep the interface but change the semantics slightly one could define e.g. getChar as:
getChar :: IO Char getChar = getWord8 >>= decodeChar latin1
Assuming latin-1 is what's used now.
The benefit would be that if the input is not in latin-1 an exception could be thrown rather than returning a Char representing the wrong Unicode code point.
I'm not sure what you mean here. All 256 possible values have a meaning. I did say 'lower 8 bits of unicode code point which is almost functionally equivalent to latin-1.' IIUC, it's latin-1 plus the two control-character ranges. There are no decoding errors for haskell98's getChar.
My proposal is for I/O functions to specify the encoding they use if they accept or return Chars (and Strings). If they deal in terms of bytes (e.g. socket functions) they should accept and return Word8s.
I would be more inclined to suggest they default to a particular well understand encoding, almost certainly UTF8. Another interface could give access to other encodings.
Optionally, text I/O functions could default to the system locale setting.
That is a disastrous idea. Please read the other flamewars^Wdiscussions on this list about this subject :) One was started by a certain Johann Tibell :) http://haskell.org/pipermail/haskell-cafe/2007-September/031724.html http://haskell.org/pipermail/haskell-cafe/2007-September/032195.html Jules

The benefit would be that if the input is not in latin-1 an exception could be thrown rather than returning a Char representing the wrong Unicode code point.
I'm not sure what you mean here. All 256 possible values have a meaning.
You're of course right. So we don't have a problem here. Maybe I was thinking of an encoding (7-bit ASCII?) where some of the 256 values are invalid.
My proposal is for I/O functions to specify the encoding they use if they accept or return Chars (and Strings). If they deal in terms of bytes (e.g. socket functions) they should accept and return Word8s.
I would be more inclined to suggest they default to a particular well understand encoding, almost certainly UTF8. Another interface could give access to other encodings.
That might be a good option. However, it would be nice if beginners could write simple console programs using System.IO and have them work correctly even if their system's encoding is not byte compatible with UTF-8. People who do I/O over the network etc. need to be more careful and should specify the encoding used. How would a UTF-8 default work on different Windows versions?
Optionally, text I/O functions could default to the system locale setting.
That is a disastrous idea.
I'm not sure about that as long as decode is called on the input to make sure that it's a valid encoding given the input bytes. Same point as above. What I would like to avoid is having to write: main = do putStrLn systemLocalEncoding "What's your name?" name <- getLine systemLocalEncoding putStrLn systemLocalEncoding $ "Hi " ++ name ++ "!" I guess we could solve this by putting the functions in different modules: System.IO -- requires explicit encoding System.IO.DefaultEncoding -- implicit use of system locale setting And have the modules export the same functions. Another option would be to include the fact that encoding is implied in the name of the function. Maybe we should start by giving some type signatures and function names. That often helps my thinking. I'll try to write something down when I get home from work. -- Johan

"Johan Tibell"
The benefit would be that if the input is not in latin-1 an exception could be thrown rather than returning a Char representing the wrong Unicode code point.
I'm not sure what you mean here. All 256 possible values have a meaning.
OTOH, going the other way could be more troublesome, I'm not sure that outputting a truncated value is what you want.
You're of course right. So we don't have a problem here. Maybe I was thinking of an encoding (7-bit ASCII?) where some of the 256 values are invalid.
Well - each byte can be converted to the equivalent code point, but 0x80-0x9F are control characters, and some of those are left undefined. Perhaps instead of truncating on output, we should map code points > 0xFF to such a value? E.g. 0x81 is undefined in both Unicode and Windows 1252. -k -- If I haven't seen further, it is by standing in the footprints of giants

On 1/23/08, Johan Tibell
they accept or return Chars (and Strings). If they deal in terms of bytes (e.g. socket functions) they should accept and return Word8s. Optionally, text I/O functions could default to the system locale setting.
Yes, this reflects my recent experience, Char is not a good representation for an 8-bit byte. This thread came out of my attempt to add a module to dataenc[1] that would make base64-string[2] obsolete. As you probably can guess I came to the conclusion that a function for data encoding with type 'String -> String' is plain wrong. :-) /M [1]: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dataenc-0.10.2 [2]: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/base64-string-0.1

On Jan 23, 2008 2:11 PM, Magnus Therning
Yes, this reflects my recent experience, Char is not a good representation for an 8-bit byte. This thread came out of my attempt to add a module to dataenc[1] that would make base64-string[2] obsolete. As you probably can guess I came to the conclusion that a function for data encoding with type 'String -> String' is plain wrong. :-)
Yes. Functions that deal with bytes shouldn't use Char. Char should be seen as and ADT representing Unicode code points. It has nothing to do with bytes. -- Johan

Johan Tibell wrote:
What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use "lower 8 bits of unicode code point" which is almost functionally equivalent to latin-1.
Which is terrible! You should have to be explicit about what encoding you expect. Python 3000 does it right.
Presumably there wasn't a sufficiently good answer available in time for haskell98.
Will there be one for haskell prime ?
The I/O library needs an overhaul but I'm not sure how to do this in a backwards compatible manner which probably would be required for inclusion in Haskell'. One could, like Python 3000, break backwards compatibility. I'm not sure about the implications of doing this. Maybe introducing a new System.IO.Unicode module would be an option. There are already some libraries that attempt to create a new string and I/O library for Haskell, based on Unicode, with a separation of byte semantics and character semantics. See for example Streams [1] or CompactString [2].
Regards, Reinier [1]: http://haskell.org/haskellwiki/Library/Streams [2]: http://twan.home.fmf.nl/compact-string/

Peter Verswyvelen
No I just used wrong terminology. When I said unicode, I actually meant UCS-x,
You might as well say UCS-4, nobody uses UCS-2 anymore. It's been replaced by UTF-16, which gives you the complexity of UTF-8 without being compact (for 99% of existing data), endianness-indifferent, or backwards compatibe with ASCII.
and with multi-byte-string-thing I meant VARIABLE-length, sorry about that. I find variable length chars so much harder to use and reason about than the fixed length characters. UTF-x is a form of compression, which is understandable, but it is IMHO a burden (since it does not allow random access to the n-th character)
Do you really need that, though? Most formats I know with enough structure that you can pick up records by offset either encode the offsets somewhere, or are restricted to ASCII, or both.
Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2 use for [Char]? UCS-32?
Internally, Haskell Chars are Unicode, and stores a code point as a 32bit (well, actually 21 bit or something) value. One Char, one code point. ByteString stores 8-bit "char"s, and the Char8 interface chops off the top bits, essentially projecting codepoints down to the ISO-8859-1 (latin1) subset. Externally, it depends on what IO library you use. As for the command line, Ian's post links to: http://www.haskell.org/ghc/docs/6.8.2/html/users_guide/release-6-8-2.html -k -- If I haven't seen further, it is by standing in the footprints of giants

On Tue, Jan 22, 2008 at 03:16:15PM +0000, Magnus Therning wrote:
On 1/22/08, Duncan Coutts
wrote: On Tue, 2008-01-22 at 09:29 +0000, Magnus Therning wrote:
I vaguely remember that in GHC 6.6 code like this
length $ map ord "a string"
being able able to generate a different answer than
length "a string"
That seems unlikely.
Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version currently in Debian Sid):
map ord "a"
[97]
map ord "ö"
[195,182]
In 6.6.1:
Prelude Data.Char> map ord "ö" [195,182] Prelude Data.Char> length "ö" 2
there are actually 2 bytes there, but your terminal is showing them as one character. Still, that seems weird to me. A Haskell Char is a Unicode character. An "ö" is either one character (unicode point 0xF6) (which, in UTF-8, is coded as two bytes) or a combination of an "o" with an umlaut (Unicode
Ian Lynagh wrote: point 776). But because the last character is not 776, the "ö" here should just be one character. I'd suspect that the two-character string comes from the terminal speaking UTF-8 to GHC expecting Latin-1. GHC 6.8 expects UTF-8, so all is fine. On my MacBook (OS X 10.4), 'ö' also immediately expands to "\303\266" when I type it in my terminal, even outside GHCi. That suggests that the terminal program doesn't handle Unicode and immediately escapes weird characters. Regards, Reinier

On 1/22/08, Ian Lynagh
On Tue, Jan 22, 2008 at 03:16:15PM +0000, Magnus Therning wrote:
On 1/22/08, Duncan Coutts
wrote: On Tue, 2008-01-22 at 09:29 +0000, Magnus Therning wrote:
I vaguely remember that in GHC 6.6 code like this
length $ map ord "a string"
being able able to generate a different answer than
length "a string"
That seems unlikely.
Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version currently in Debian Sid):
map ord "a" [97] map ord "ö" [195,182]
In 6.6.1:
Prelude Data.Char> map ord "ö" [195,182] Prelude Data.Char> length "ö" 2
there are actually 2 bytes there, but your terminal is showing them as one character.
Yes, of course, stupid me. But it is still the UTF-8 representation of "ö", not Latin-1, and this brings me back to my original question, is this an intentional change in 6.8?
map ord "ö" [246] map ord "åɓz𝐀" [229,595,65370,119808]
6.8 produces Unicode code points rather then a particular encoding. /M

Magnus Therning wrote:
Yes, of course, stupid me. But it is still the UTF-8 representation of "ö", not Latin-1, and this brings me back to my original question, is this an intentional change in 6.8?
map ord "ö" [246] map ord "åɓz𝐀" [229,595,65370,119808]
6.8 produces Unicode code points rather then a particular encoding.
The key point here is this has nothing to do with GHC. GHC's behaviour has not changed in this regard. This is about GHCi! [And, to some extent, the behaviour of whatever shell / terminal emulator you run ghci in] Sounds like a pedantic difference, but it's not. The difference here is what GHCi is feeding into your haskell code when you type the sequence "ö" at a ghci prompt, rather than anything different about the underlying behaviour of map, ord, length, show, putStr. map, ord, length, show, putStr have not changed from 6.6 to 6.8. I don't have 6.8 handy myself but from your demonstration is would appear that 6.8's ghci correctly understands whatever input encoding is being used in whatever terminal environment you are choosing to run ghci within. Whereas, 6.6's ghci was using a single-byte terminal approach, and your terminal environment was encoding ö as two characters. Jules

On Tue, Jan 22, 2008 at 03:59:24PM +0000, Magnus Therning wrote:
Yes, of course, stupid me. But it is still the UTF-8 representation of "ö", not Latin-1, and this brings me back to my original question, is this an intentional change in 6.8?
Yes (in 6.8.2, to be precise). It's in the release notes: http://www.haskell.org/ghc/docs/6.8.2/html/users_guide/release-6-8-2.html GHCi now treats all input as unicode, except for the Windows console where we do the correct conversion from the current code page. Thanks Ian

On 1/22/08, Ian Lynagh
On Tue, Jan 22, 2008 at 03:59:24PM +0000, Magnus Therning wrote:
Yes, of course, stupid me. But it is still the UTF-8 representation of
"ö",
not Latin-1, and this brings me back to my original question, is this an intentional change in 6.8?
Yes (in 6.8.2, to be precise).
It's in the release notes:
http://www.haskell.org/ghc/docs/6.8.2/html/users_guide/release-6-8-2.html GHCi now treats all input as unicode, except for the Windows console where we do the correct conversion from the current code page.
Excellent news. One step closer to sanity when it comes to character encodings on the command line :-) /M
participants (13)
-
Bulat Ziganshin
-
david48
-
Derek Elkins
-
Duncan Coutts
-
Felipe Lessa
-
Henning Thielemann
-
Ian Lynagh
-
Johan Tibell
-
Jules Bean
-
Ketil Malde
-
Magnus Therning
-
Peter Verswyvelen
-
Reinier Lamers