
From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Donald Bruce Stewart
I don't know if anybody cares, but... Today a wrote some
andrewcoppin: trivial code to
decode (not encode) UTF-16.
I believe somebody out there has a UTF-8 decoder, but I needed UTF-16 as it happens.
Perhaps you could polish it up, and provide it in a form suitable for use as a patch to:
http://code.haskell.org/utf8-string/
that is, put it in a module:
Codec.Binary.UTF16.String
and provide the functions:
encode :: String -> [Word8] decode :: [Word8] -> String
? And then submit that as a patch to Eric, the utf8 maintainer.
-- Don
There is a UTF16 en/decoder in Foreign.C.String (see cWcharsToChars & charsToCWchars): http://darcs.haskell.org/libraries/base/Foreign/C/String.hs but it only seems to be available for Windows users, via the CWSTring functions. In Takusen we also have a UTF8 module (it's about the fourth or fifth out there, after HXML and John Meacham's, and someone else's - Graham Klyne?, and one hidden away in GHC's internals). It has pure en/decode functions (String <-> [Word8]), naturally (which we ripped off from John Meacham), but we were more interested in efficient marshalling from CStrings (or data buffers, if you like), so we wrote specific code to marshall CString -> String fairly quickly, and space efficiently (see fromUTF8Ptr, which is wrapped by peekUTF8String{Len}): http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs We stuck it in the Foreign.C namespace, rather than Codec, because we're doing more FFI related stuff. I'm not sure what the best location is; perhaps there should be a split, with FFI functions (withUTF8String, peekUTF8String) in Foreign.C, and pure functions in Codec. (Also, is there a wiki page somewhere which gives advice as to how to locate/name library modules, and what the currently occupied namespace is, including user libs like those on Hackage? It's sometimes a bit tricky to try to figure out where to put a new module.) Obviously a proliferation of UTF8 modules isn't great for code re-use. Is there a plan to consolidate and expose UTF8 and UTF16 de- and encoders in the libraries? I note that the various UTF8 modules have fairly similar implementations, and differ mainly w.r.t. how much of the UTF8 codepoint space they handle (for example, HXML's decodes up to 6 bytes, which isn't strictly standards compliant). Also, some choice as how to handle errors in the byte stream might be nice i.e. the user could choose between functions which raise errors, or introduce substition chars. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************