
On 02/07/2009 23:04, Judah Jacobson wrote:
On Tue, Jun 30, 2009 at 5:03 AM, Simon Marlow
wrote: Ticket:
http://hackage.haskell.org/trac/ghc/ticket/3337
For the proposed new additions, see:
* http://www.haskell.org/~simonmar/base/System-IO.html#23 System.IO (Unicode encoding/decoding)
* http://www.haskell.org/~simonmar/base/System-IO.html#25 System.IO (Newline conversion)
Discussion period: 2 weeks (14 July).
Three points:
1) It would be good to have an hGetEncoding function, so that we can temporarily set the encoding of a Handle like stdin without affecting the rest of the program.
Sure. This might expose the fact that there's no instance Eq TextEncoding, though - I can imagine someone wanting to know whether localeEncoding is UTF-8 or not. Perhaps there should also be textEncodingName :: TextEncoding -> String the idea being that if you pass the String back to mkTextEncoding you get the same encoding. But what about normalisation issues, e.g. "UTF-8" vs. "UTF8"?
2) It looks like your API always throws an error on invalid input; it would be great if there were some way to customize this behavior. Nothing complicated, maybe just an enum which specifies one of the following behaviors:
- throw an error - ignore (i.e., drop) invalid bytes/Chars - replace undecodable bytes with u+FFFD and unencodable Chars with '?'
Yes.
My preference for the API change would be to add a function in GHC.IO.Encoding.Iconv; for example,
mkTextEncodingError :: String -> ErrorHandling -> IO TextEncoding
So you're suggesting that we implement this only for iconv? That would be easy enough, but then it wouldn't be available on Windows. Another way would be to implement it at the Handle level, by catching encoding/decoding errors from the codec and applying the appropriate workaround. This is a lot more work, of course.
since this is similar to how GHC.IO.Encoding.Latin1 allows error handling by providing latin1 and latin1_checked as separate encoders.
Any more complicated behavior is probably best handled by something like the text package.
3) How hard would it be to get Windows code page support working? I'd like that a lot since it would further simplify the code in Haskeline. I can help out with the implementation if it's just a question of time.
Ok, so I did look into this. The problem is that the MultiByteToWideChar API just isn't good enough. 1. It only converts to UTF-16. So I can handle this by using UTF-16 as our internal representation instead of UTF-32, and indeed I have made all the changes for this - there is a #define in the the library. I found it slower than UTF-32, however. 2. If there's a decoding error, you don't get to find out where in the input the error occurred, or do a partial conversion. 3. If there isn't enough room in the target buffer, you don't get to do a partial conversion. 4. Detecting errors is apparently only supported on Win XP and later (MB_ERR_INVALID_CHARS), and for some code pages it isn't supported at all. 2 and 3 are the real show-stoppers. Duncan Coutts found this code from AT&T UWIN that implements iconv in terms of MultiByteToWideChar: http://www.google.com/codesearch/p?hl=en&sa=N&cd=2&ct=rc#0IKL7zWk-JU/src/lib/libast/comp/iconv.c&l=333 notice how it uses a binary search strategy to solve the 3rd problem above. Yuck! This would be the common case if we used this code in the IO library. This is why I wrote our own UTF-{8,16,32} codecs for GHC (borrowing some code from the text package). BTW, Python uses its own automatically-generated codecs for Windows codepages. Maybe we should do that too. Cheers, Simon