portable encoding/decoding without going via a handle

Hi, I need to convert directly between different string encodings, rather than just using a particular encoding when reading from/writing to a Handle. I'm aware of the following options, but they have a few problems: - text-icu: not easily usable on Windows as it requires libicu - text: just handles utf8/16/32 - iconv: POSIX only It seems like GHC's TextEncoding has the necessary low-level functionality (http://hackage.haskell.org/packages/archive/base/latest/doc/html/GHC-IO-Enco...), but I can't find any high-level interface for directly transcoding between String/Bytestring/Text. Am I missing something, or would this be a useful addition as a separate library? Cheers, Ganesh

Ganesh Sittampalam
I need to convert directly between different string encodings, rather than just using a particular encoding when reading from/writing to a Handle.
I'm aware of the following options, but they have a few problems:
- text-icu: not easily usable on Windows as it requires libicu - text: just handles utf8/16/32 - iconv: POSIX only
It seems like GHC's TextEncoding has the necessary low-level functionality (http://hackage.haskell.org/packages/archive/base/latest/doc/html/GHC-IO-Enco...), but I can't find any high-level interface for directly transcoding between String/Bytestring/Text.
Am I missing something, or would this be a useful addition as a separate library?
btw, looking at the GHC.IO.Encoding.* modules, it seems to me that that 'mkTextEncoding'[1] only supports utf8/16/32 in a system independent fashion: ,---- | The set of known encodings is system-dependent, but includes at least: | | - UTF-8 | - UTF-16, UTF-16BE, UTF-16LE | - UTF-32, UTF-32BE, UTF-32LE | | On systems using GNU iconv (e.g. Linux), there is additional notation | for specifying how illegal characters are handled: | | - a suffix of //IGNORE, e.g. UTF-8//IGNORE, will cause all illegal | sequences on input to be ignored, and on output will drop all code | points that have no representation in the target encoding. | | - a suffix of //TRANSLIT will choose a replacement character for | illegal sequences or code points. | | On Windows, you can access supported code pages with the prefix CP; for | example, "CP1250". `---- ...so does using GHC.Encoding.* actually provide you with more encodings than using the other options ('text' et al.) you mentioned? which text encodings beyond the UTF-family do you need btw? [1]: http://hackage.haskell.org/packages/archive/base/4.6.0.0/doc/html/GHC-IO-Enc... cheers, hvr

On 25/11/2012 10:51, Herbert Valerio Riedel wrote:
btw, looking at the GHC.IO.Encoding.* modules, it seems to me that that 'mkTextEncoding'[1] only supports utf8/16/32 in a system independent fashion:
...so does using GHC.Encoding.* actually provide you with more encodings than using the other options ('text' et al.) you mentioned? which text encodings beyond the UTF-family do you need btw?
I actually only need ones that exist on the current platform because they're currently in use as GHC's encodings when reading from the filesystem/console. In theory I think the need to do the transcoding could be avoided by just setting those encodings to the right values in the first place, but in practice it's hard to do that as a purely local change. Cheers, Ganesh
participants (2)
-
Ganesh Sittampalam
-
Herbert Valerio Riedel