
On 02/07/2009 23:04, Judah Jacobson wrote:
1) It would be good to have an hGetEncoding function, so that we can temporarily set the encoding of a Handle like stdin without affecting the rest of the program.
I have added this, but it might not behave exactly as you want. hGetEncoding :: Handle -> IO (Maybe TextEncoding) The issue is saving and restoring of the codec state. A TextEncoding is a factory that makes new codec instances; it has no state. However, the codec in use on a Handle does have a state. So if you save and restore the codec, you lose the state. e.g. in UTF-16, you'll get a new BOM in the output. You might or might not want to save and restore the state, I can imagine both possibilities being useful. For now however, I propose we provide the non-state-saving version, clearly documented as such. Providing a state-saving version would need a new type to represent the codec + state, incedentally.
2) It looks like your API always throws an error on invalid input; it would be great if there were some way to customize this behavior. Nothing complicated, maybe just an enum which specifies one of the following behaviors:
- throw an error - ignore (i.e., drop) invalid bytes/Chars - replace undecodable bytes with u+FFFD and unencodable Chars with '?'
My preference for the API change would be to add a function in GHC.IO.Encoding.Iconv; for example,
mkTextEncodingError :: String -> ErrorHandling -> IO TextEncoding
since this is similar to how GHC.IO.Encoding.Latin1 allows error handling by providing latin1 and latin1_checked as separate encoders.
Any more complicated behavior is probably best handled by something like the text package.
Note that if you're using GNU iconv, you can say mkTextEncoding "UTF-8//IGNORE" to get the version that silently drops illegal characters (there's also "//TRANSLIT", which tries to find an alternative for an illegal character). This is not portable, so I can't provide it as a general facility in GHC.IO.Encoding.Iconv. Cheers, Simon