Re: [Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP

17 Oct 2010

      On 17/Oct/10 3:37 PM, Michael Snoyman wrote:
...
On Sun, Oct 17, 2010 at 2:26 PM, Ionut G. Stan  wrote:
...
Thanks Michael, now it works indeed. But I don't understand, is there any
inherent problem with Haskell's built-in String? Should one choose
ByteString when dealing with Unicode stuff? Or, is there any resource that
describes in one place all the problems Haskell has with Unicode?
There's no problem with String; you just need to remember what it
means. A String is a list of Chars, and a Char is a unicode codepoint.
On the other hand, the HTTP protocol deals with *bytes*, not Unicode
codepoints. In order to convert between the two, you need some type of
encoding; in the case of JSON, I believe this is always specified as
UTF-8.
The problem for you is that the HTTP package does *not* perform UTF-8
decoding of the raw bytes sent over the network. Instead, I believe it
is doing the naive byte-to-codepoint conversion, aka Latin-1 decoding.
By downloading the data as bytes (ie, a ByteString), you can then
explicitly state that you want to do UTF-8 decoding instead of
Latin-1.
It would be entirely possible to write an HTTP library that does this
automatically, but it would be inherently limited to a single encoding
type. By dealing directly with bytestrings, you can work with any
character encoding, as well as binary data such as images which does
not have any character encoding.
OK, I think I understand now. I was under the assumption that the 
Network.HTTP package will take a look at the Content-Type header and do 
a behind-the-scene conversion before decoding those bytes.

Thanks for your help.

-- 
Ionuț G. Stan  |  http://igstan.ro