Re: Proposal #3337: expose Unicode and newline translation from System.IO

3 Jul 2009

      On 02/07/2009 23:04, Judah Jacobson wrote:
...
On Tue, Jun 30, 2009 at 5:03 AM, Simon Marlow  wrote:
...
Ticket:
http://hackage.haskell.org/trac/ghc/ticket/3337
For the proposed new additions, see:
* http://www.haskell.org/~simonmar/base/System-IO.html#23
   System.IO (Unicode encoding/decoding)
* http://www.haskell.org/~simonmar/base/System-IO.html#25
   System.IO (Newline conversion)
Discussion period: 2 weeks (14 July).
Three points:
1) It would be good to have an hGetEncoding function, so that we can
temporarily set the encoding of a Handle like stdin without affecting
the rest of the program.
Sure.  This might expose the fact that there's no instance Eq 
TextEncoding, though - I can imagine someone wanting to know whether 
localeEncoding is UTF-8 or not.  Perhaps there should also be

   textEncodingName :: TextEncoding -> String

the idea being that if you pass the String back to mkTextEncoding you 
get the same encoding.  But what about normalisation issues, e.g. 
"UTF-8" vs. "UTF8"?
...
2) It looks like your API always throws an error on invalid input; it
would be great if there were some way to customize this behavior.
Nothing complicated, maybe just an enum which specifies one of the
following behaviors:
- throw an error
- ignore (i.e., drop) invalid bytes/Chars
- replace undecodable bytes with u+FFFD and unencodable Chars with '?'
Yes.
...
My preference for the API change would be to add a function in
GHC.IO.Encoding.Iconv; for example,
mkTextEncodingError :: String ->  ErrorHandling ->  IO TextEncoding
So you're suggesting that we implement this only for iconv?  That would 
be easy enough, but then it wouldn't be available on Windows.  Another 
way would be to implement it at the Handle level, by catching 
encoding/decoding errors from the codec and applying the appropriate 
workaround.  This is a lot more work, of course.
...
since this is similar to how GHC.IO.Encoding.Latin1 allows error
handling by providing latin1 and  latin1_checked as separate encoders.
Any more complicated behavior is probably best handled by something
like the text package.
3) How hard would it be to get Windows code page support working?  I'd
like that a lot since it would further simplify the code in Haskeline.
  I can help out with the implementation if it's just a question of
time.
Ok, so I did look into this.  The problem is that the 
MultiByteToWideChar API just isn't good enough.

  1. It only converts to UTF-16.  So I can handle this by using UTF-16
     as our internal representation instead of UTF-32, and indeed I
     have made all the changes for this - there is a #define in the
     the library.  I found it slower than UTF-32, however.

  2. If there's a decoding error, you don't get to find out where
     in the input the error occurred, or do a partial conversion.

  3. If there isn't enough room in the target buffer, you don't get
     to do a partial conversion.

  4. Detecting errors is apparently only supported on Win XP and
     later (MB_ERR_INVALID_CHARS), and for some code pages it
     isn't supported at all.

2 and 3 are the real show-stoppers.  Duncan Coutts found this code from 
AT&T UWIN that implements iconv in terms of MultiByteToWideChar:

http://www.google.com/codesearch/p?hl=en&sa=N&cd=2&ct=rc#0IKL7zWk-JU/src/lib/libast/comp/iconv.c&l=333

notice how it uses a binary search strategy to solve the 3rd problem 
above.  Yuck!  This would be the common case if we used this code in the 
IO library.  This is why I wrote our own UTF-{8,16,32} codecs for GHC 
(borrowing some code from the text package).

BTW, Python uses its own automatically-generated codecs for Windows 
codepages.  Maybe we should do that too.

Cheers,
	Simon