Proposal #3337: expose Unicode and newline translation from System.IO

Ticket: http://hackage.haskell.org/trac/ghc/ticket/3337 For the proposed new additions, see: * http://www.haskell.org/~simonmar/base/System-IO.html#23 System.IO (Unicode encoding/decoding) * http://www.haskell.org/~simonmar/base/System-IO.html#25 System.IO (Newline conversion) Discussion period: 2 weeks (14 July).

This patch increases the need to make binary Handles a different type. When you set the TextEncoding of an output Handle to utf16 or utf32, does that trigger output of a BOM? When you set the TextEncoding of an input Handle to utf16 or utf32, does it immediately read a character looking for a BOM? Do you really need two Newline modes per Handle? Most Handles are unidirectional, and even a ReadWrite Handle is reading and writing to the same file. The main benefit seems to be that you can apply universalNewlineMode to any Handle, but is that worth the complication?

On 30/06/2009 14:10, Ross Paterson wrote:
This patch increases the need to make binary Handles a different type.
It does indeed. That's something for another proposal, though.
When you set the TextEncoding of an output Handle to utf16 or utf32, does that trigger output of a BOM?
No, but the BOM will be output as part of the first writing operation.
When you set the TextEncoding of an input Handle to utf16 or utf32, does it immediately read a character looking for a BOM?
No, but it looks for a BOM when the first batch of bytes is decoded, which will happen the first time you read from the Handle.
Do you really need two Newline modes per Handle? Most Handles are unidirectional, and even a ReadWrite Handle is reading and writing to the same file. The main benefit seems to be that you can apply universalNewlineMode to any Handle, but is that worth the complication?
There are also bidirectional Sockets. But I take your point; universalNewlineMode is indeed the main reason we have separate input/output modes. I wouldn't have any objection to simplifying it, if people don't think the extra complication is worthwhile. Cheers, Simon

On Tue, Jun 30, 2009 at 5:03 AM, Simon Marlow
Ticket:
http://hackage.haskell.org/trac/ghc/ticket/3337
For the proposed new additions, see:
* http://www.haskell.org/~simonmar/base/System-IO.html#23 System.IO (Unicode encoding/decoding)
* http://www.haskell.org/~simonmar/base/System-IO.html#25 System.IO (Newline conversion)
Discussion period: 2 weeks (14 July).
Three points: 1) It would be good to have an hGetEncoding function, so that we can temporarily set the encoding of a Handle like stdin without affecting the rest of the program. 2) It looks like your API always throws an error on invalid input; it would be great if there were some way to customize this behavior. Nothing complicated, maybe just an enum which specifies one of the following behaviors: - throw an error - ignore (i.e., drop) invalid bytes/Chars - replace undecodable bytes with u+FFFD and unencodable Chars with '?' My preference for the API change would be to add a function in GHC.IO.Encoding.Iconv; for example, mkTextEncodingError :: String -> ErrorHandling -> IO TextEncoding since this is similar to how GHC.IO.Encoding.Latin1 allows error handling by providing latin1 and latin1_checked as separate encoders. Any more complicated behavior is probably best handled by something like the text package. 3) How hard would it be to get Windows code page support working? I'd like that a lot since it would further simplify the code in Haskeline. I can help out with the implementation if it's just a question of time. Thanks again for taking care of all this, -Judah

On 02/07/2009 23:04, Judah Jacobson wrote:
On Tue, Jun 30, 2009 at 5:03 AM, Simon Marlow
wrote: Ticket:
http://hackage.haskell.org/trac/ghc/ticket/3337
For the proposed new additions, see:
* http://www.haskell.org/~simonmar/base/System-IO.html#23 System.IO (Unicode encoding/decoding)
* http://www.haskell.org/~simonmar/base/System-IO.html#25 System.IO (Newline conversion)
Discussion period: 2 weeks (14 July).
Three points:
1) It would be good to have an hGetEncoding function, so that we can temporarily set the encoding of a Handle like stdin without affecting the rest of the program.
Sure. This might expose the fact that there's no instance Eq TextEncoding, though - I can imagine someone wanting to know whether localeEncoding is UTF-8 or not. Perhaps there should also be textEncodingName :: TextEncoding -> String the idea being that if you pass the String back to mkTextEncoding you get the same encoding. But what about normalisation issues, e.g. "UTF-8" vs. "UTF8"?
2) It looks like your API always throws an error on invalid input; it would be great if there were some way to customize this behavior. Nothing complicated, maybe just an enum which specifies one of the following behaviors:
- throw an error - ignore (i.e., drop) invalid bytes/Chars - replace undecodable bytes with u+FFFD and unencodable Chars with '?'
Yes.
My preference for the API change would be to add a function in GHC.IO.Encoding.Iconv; for example,
mkTextEncodingError :: String -> ErrorHandling -> IO TextEncoding
So you're suggesting that we implement this only for iconv? That would be easy enough, but then it wouldn't be available on Windows. Another way would be to implement it at the Handle level, by catching encoding/decoding errors from the codec and applying the appropriate workaround. This is a lot more work, of course.
since this is similar to how GHC.IO.Encoding.Latin1 allows error handling by providing latin1 and latin1_checked as separate encoders.
Any more complicated behavior is probably best handled by something like the text package.
3) How hard would it be to get Windows code page support working? I'd like that a lot since it would further simplify the code in Haskeline. I can help out with the implementation if it's just a question of time.
Ok, so I did look into this. The problem is that the MultiByteToWideChar API just isn't good enough. 1. It only converts to UTF-16. So I can handle this by using UTF-16 as our internal representation instead of UTF-32, and indeed I have made all the changes for this - there is a #define in the the library. I found it slower than UTF-32, however. 2. If there's a decoding error, you don't get to find out where in the input the error occurred, or do a partial conversion. 3. If there isn't enough room in the target buffer, you don't get to do a partial conversion. 4. Detecting errors is apparently only supported on Win XP and later (MB_ERR_INVALID_CHARS), and for some code pages it isn't supported at all. 2 and 3 are the real show-stoppers. Duncan Coutts found this code from AT&T UWIN that implements iconv in terms of MultiByteToWideChar: http://www.google.com/codesearch/p?hl=en&sa=N&cd=2&ct=rc#0IKL7zWk-JU/src/lib/libast/comp/iconv.c&l=333 notice how it uses a binary search strategy to solve the 3rd problem above. Yuck! This would be the common case if we used this code in the IO library. This is why I wrote our own UTF-{8,16,32} codecs for GHC (borrowing some code from the text package). BTW, Python uses its own automatically-generated codecs for Windows codepages. Maybe we should do that too. Cheers, Simon

On Fri, Jul 3, 2009 at 1:23 AM, Simon Marlow
On 02/07/2009 23:04, Judah Jacobson wrote:
On Tue, Jun 30, 2009 at 5:03 AM, Simon Marlow
wrote: Ticket:
http://hackage.haskell.org/trac/ghc/ticket/3337
For the proposed new additions, see:
* http://www.haskell.org/~simonmar/base/System-IO.html#23 System.IO (Unicode encoding/decoding)
* http://www.haskell.org/~simonmar/base/System-IO.html#25 System.IO (Newline conversion)
Discussion period: 2 weeks (14 July).
3) How hard would it be to get Windows code page support working? I'd like that a lot since it would further simplify the code in Haskeline. I can help out with the implementation if it's just a question of time.
Ok, so I did look into this. The problem is that the MultiByteToWideChar API just isn't good enough. [...] BTW, Python uses its own automatically-generated codecs for Windows codepages. Maybe we should do that too.
That approach seems best; and it would be a nice small step towards a pure Haskell replacement for libiconv. I've started working on this. -Judah

On 02/07/2009 23:04, Judah Jacobson wrote:
1) It would be good to have an hGetEncoding function, so that we can temporarily set the encoding of a Handle like stdin without affecting the rest of the program.
I have added this, but it might not behave exactly as you want. hGetEncoding :: Handle -> IO (Maybe TextEncoding) The issue is saving and restoring of the codec state. A TextEncoding is a factory that makes new codec instances; it has no state. However, the codec in use on a Handle does have a state. So if you save and restore the codec, you lose the state. e.g. in UTF-16, you'll get a new BOM in the output. You might or might not want to save and restore the state, I can imagine both possibilities being useful. For now however, I propose we provide the non-state-saving version, clearly documented as such. Providing a state-saving version would need a new type to represent the codec + state, incedentally.
2) It looks like your API always throws an error on invalid input; it would be great if there were some way to customize this behavior. Nothing complicated, maybe just an enum which specifies one of the following behaviors:
- throw an error - ignore (i.e., drop) invalid bytes/Chars - replace undecodable bytes with u+FFFD and unencodable Chars with '?'
My preference for the API change would be to add a function in GHC.IO.Encoding.Iconv; for example,
mkTextEncodingError :: String -> ErrorHandling -> IO TextEncoding
since this is similar to how GHC.IO.Encoding.Latin1 allows error handling by providing latin1 and latin1_checked as separate encoders.
Any more complicated behavior is probably best handled by something like the text package.
Note that if you're using GNU iconv, you can say mkTextEncoding "UTF-8//IGNORE" to get the version that silently drops illegal characters (there's also "//TRANSLIT", which tries to find an alternative for an illegal character). This is not portable, so I can't provide it as a general facility in GHC.IO.Encoding.Iconv. Cheers, Simon

On Tue, 2009-06-30 at 13:03 +0100, Simon Marlow wrote:
Ticket:
http://hackage.haskell.org/trac/ghc/ticket/3337
For the proposed new additions, see:
* http://www.haskell.org/~simonmar/base/System-IO.html#23 System.IO (Unicode encoding/decoding)
* http://www.haskell.org/~simonmar/base/System-IO.html#25 System.IO (Newline conversion)
Discussion period: 2 weeks (14 July).
A couple things we brought up at the ghc irc meeting yesterday: * UTF-8 with or without BOM? or variants utf8_bom. Do we need all three variants: (pass through bom, produce no bom) -- raw utf8 (accept and ignore bom, produce bom) -- utf8 with bom (accept and ignore bom, produce no bom) -- permissive After thinking about it a bit, I think we can get away with just the existing utf8 and a utf8_bom that accepts a bom and produces a bom. The reason is that to get the third behaviour you just read with utf8_bom and write with utf8. Most operations on text files are read or write of the whole file, not read/write on a single file. * For the moment we are not publicly exposing the TextEncoding type. Later we may want to consider making TextEncoding pure (using ST) and share it for pure conversions String/Text <-> ByteString. Duncan

On Tue, Jun 30, 2009 at 01:03:17PM +0100, Simon Marlow wrote:
* http://www.haskell.org/~simonmar/base/System-IO.html#23 System.IO (Unicode encoding/decoding)
Is it possible to make an availableEncodings :: IO [(String, TextEncoding)] ? Also, mkTextEncoding says that it throws an isDoesNotExistError if the named encoding doesn't exist, but the code in base at least looks like it throws InvalidArgument on Windows, and nothing on other platforms. Perhaps it's different in your tree, though. Thanks Ian

On 05/07/2009 13:15, Ian Lynagh wrote:
On Tue, Jun 30, 2009 at 01:03:17PM +0100, Simon Marlow wrote:
* http://www.haskell.org/~simonmar/base/System-IO.html#23 System.IO (Unicode encoding/decoding)
Is it possible to make an availableEncodings :: IO [(String, TextEncoding)] ?
No way that I know of. iconv doesn't give you a way to enumerate the available encodings.
Also, mkTextEncoding says that it throws an isDoesNotExistError if the named encoding doesn't exist, but the code in base at least looks like it throws InvalidArgument on Windows, and nothing on other platforms. Perhaps it's different in your tree, though.
It throws NoSuchThing on Windows: mkTextEncoding e = ioException (IOError Nothing NoSuchThing "mkTextEncoding" ("unknown encoding:" ++ e) Nothing Nothing) but on Unix, you're right, there's no exception until the encoding is instantiated, which happens when a Handle is opened. I'll look into fixing this. Cheers, Simon
participants (5)
-
Duncan Coutts
-
Ian Lynagh
-
Judah Jacobson
-
Ross Paterson
-
Simon Marlow