
I don't know if anybody cares, but... Today a wrote some trivial code to decode (not encode) UTF-16. I believe somebody out there has a UTF-8 decoder, but I needed UTF-16 as it happens. (I didn't bother decoding code points outside the BMP - I'm sure you can figure out why.) If anybody is interested, I can share the code - but it's pretty minimal...

andrewcoppin:
I don't know if anybody cares, but... Today a wrote some trivial code to decode (not encode) UTF-16.
I believe somebody out there has a UTF-8 decoder, but I needed UTF-16 as it happens. (I didn't bother decoding code points outside the BMP - I'm sure you can figure out why.)
If anybody is interested, I can share the code - but it's pretty minimal...
Perhaps you could polish it up, and provide it in a form suitable for use as a patch to: http://code.haskell.org/utf8-string/ that is, put it in a module: Codec.Binary.UTF16.String and provide the functions: encode :: String -> [Word8] decode :: [Word8] -> String ? And then submit that as a patch to Eric, the utf8 maintainer. -- Don

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Donald Bruce Stewart
I don't know if anybody cares, but... Today a wrote some
andrewcoppin: trivial code to
decode (not encode) UTF-16.
I believe somebody out there has a UTF-8 decoder, but I needed UTF-16 as it happens.
Perhaps you could polish it up, and provide it in a form suitable for use as a patch to:
http://code.haskell.org/utf8-string/
that is, put it in a module:
Codec.Binary.UTF16.String
and provide the functions:
encode :: String -> [Word8] decode :: [Word8] -> String
? And then submit that as a patch to Eric, the utf8 maintainer.
-- Don
There is a UTF16 en/decoder in Foreign.C.String (see cWcharsToChars & charsToCWchars): http://darcs.haskell.org/libraries/base/Foreign/C/String.hs but it only seems to be available for Windows users, via the CWSTring functions. In Takusen we also have a UTF8 module (it's about the fourth or fifth out there, after HXML and John Meacham's, and someone else's - Graham Klyne?, and one hidden away in GHC's internals). It has pure en/decode functions (String <-> [Word8]), naturally (which we ripped off from John Meacham), but we were more interested in efficient marshalling from CStrings (or data buffers, if you like), so we wrote specific code to marshall CString -> String fairly quickly, and space efficiently (see fromUTF8Ptr, which is wrapped by peekUTF8String{Len}): http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs We stuck it in the Foreign.C namespace, rather than Codec, because we're doing more FFI related stuff. I'm not sure what the best location is; perhaps there should be a split, with FFI functions (withUTF8String, peekUTF8String) in Foreign.C, and pure functions in Codec. (Also, is there a wiki page somewhere which gives advice as to how to locate/name library modules, and what the currently occupied namespace is, including user libs like those on Hackage? It's sometimes a bit tricky to try to figure out where to put a new module.) Obviously a proliferation of UTF8 modules isn't great for code re-use. Is there a plan to consolidate and expose UTF8 and UTF16 de- and encoders in the libraries? I note that the various UTF8 modules have fairly similar implementations, and differ mainly w.r.t. how much of the UTF8 codepoint space they handle (for example, HXML's decodes up to 6 bytes, which isn't strictly standards compliant). Also, some choice as how to handle errors in the byte stream might be nice i.e. the user could choose between functions which raise errors, or introduce substition chars. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Hello Alistair, Thursday, July 26, 2007, 12:29:06 PM, you wrote:
Obviously a proliferation of UTF8 modules isn't great for code re-use. Is there a plan to consolidate and expose UTF8 and UTF16 de- and encoders in the libraries?
afair there is utf-string module, which provides utf-8 functionality. may be we should just add utf-16 support there? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

From: Bulat Ziganshin [mailto:bulat.ziganshin@gmail.com]
Obviously a proliferation of UTF8 modules isn't great for code re-use. Is there a plan to consolidate and expose UTF8 and UTF16 de- and encoders in the libraries?
afair there is utf-string module, which provides utf-8 functionality. may be we should just add utf-16 support there?
Probably. Is Codec.Binary.UTF8.String the best namespace for it? (why Binary?) Where should FFI functions that marshal to-and-from UTF8 encoded CStrings sit? Also, the UTF8 decoder decodes up to 6 bytes per char, which AFAIUI is not standards compliant. It'd be nice if these ended up somewhere in the base libs, or (if we're looking to reduce the size of base) in the standard distributions, so that (as a UTF8 library user) I don't feel compelled to either: - add another dependency to my project, or - copy the code into my project, or - implement my own de/encoder to fill a gap in this one Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Hello Donald, Thursday, July 26, 2007, 8:13:37 AM, you wrote:
I don't know if anybody cares, but... Today a wrote some trivial code to decode (not encode) UTF-16.
These functions already exist in win-specific part of base: cWcharsToChars :: [CWchar] -> [Char] charsToCWchars :: [Char] -> [CWchar] #ifdef mingw32_HOST_OS -- On Windows, wchar_t is 16 bits wide and CWString uses the UTF-16 encoding. -- coding errors generate Chars in the surrogate range cWcharsToChars = map chr . fromUTF16 . map fromIntegral where fromUTF16 (c1:c2:wcs) | 0xd800 <= c1 && c1 <= 0xdbff && 0xdc00 <= c2 && c2 <= 0xdfff = ((c1 - 0xd800)*0x400 + (c2 - 0xdc00) + 0x10000) : fromUTF16 wcs fromUTF16 (c:wcs) = c : fromUTF16 wcs fromUTF16 [] = [] charsToCWchars = foldr utf16Char [] . map ord where utf16Char c wcs | c < 0x10000 = fromIntegral c : wcs | otherwise = let c' = c - 0x10000 in fromIntegral (c' `div` 0x400 + 0xd800) : fromIntegral (c' `mod` 0x400 + 0xdc00) : wcs -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

"Andrew" == Andrew Coppin
writes:
Andrew> I believe somebody out there has a UTF-8 decoder, but I Andrew> needed UTF-16 as it happens. (I didn't bother decoding Andrew> code points outside the BMP - I'm sure you can figure out Andrew> why.) Well I can't. And if you don't, then its not a UTF-16 decoder. The whole point of UTF-16 is to blight everybodies lives by introducing a gap in the codepoint space. -- Colin Adams Preston Lancashire
participants (5)
-
Andrew Coppin
-
Bayley, Alistair
-
Bulat Ziganshin
-
Colin Paul Adams
-
dons@cse.unsw.edu.au