
What is the state of UTF8 support in Haskell libraries (base or user-contributed)? I had a need for a UTF8 en & de-coder for Takusen, and after looking around couldn't find anything particularly satisfactory, so ended up writing (yet another) one. I'm interested mainly in marshalling to/from CStrings, so support for functions like peekUTF8String, newUTF8String, withUTF8String, etc is interesting. I realise that one can use one of the pure decoders after a peekCString, but that means building an intermediate list, which isn't strictly necessary. So far I've found the following: - John Meacham's UTF8 lib: http://repetae.net/repos/jhc/UTF8.hs (only handles codepoints < 65536, pure String <-> [Word8] so no direct CString marshalling) - HXT's Text.XML.HXT.DOM.Unicode: http://www.fh-wedel.de/~si/HXmlToolbox/ (full Unicode range - up to 6 bytes per char, pure String <-> String) - George Russell's: http://www.haskell.org/pipermail/glasgow-haskell-users/2004-April/006564.htm... (buggy - won't roundtrip chars > 127, pure String <-> String) The one I wrote, which is largely based on John Meacham's and HXT's code, can be seen here: http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs Alistair

Hi Alistar,
On Fri, 02 Feb 2007 21:01:04 +0900, Alistair Bayley
What is the state of UTF8 support in Haskell libraries (base or user-contributed)? I had a need for a UTF8 en & de-coder for Takusen, and after looking around couldn't find anything particularly satisfactory, so ended up writing (yet another) one.
regex-posix doesn't support UTF8. Because regex-posix uses POSIX regex. So this problem can't fixed by only correct UTF8 en & de-coder. If someone is interested in suppourting UTF8, I recommend to use oniguruma. http://www.geocities.jp/kosako3/oniguruma/ Oniguruma also supports UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, etc .... And it is portable, it's available both on Unix and Windows. So I think it is best regex C library to choose backend. -- shelarcy <shelarcy capella.freemail.ne.jp> http://page.freett.com/shelarcy/

From: libraries-bounces@haskell.org
If someone is interested in suppourting UTF8, I recommend to use oniguruma.
http://www.geocities.jp/kosako3/oniguruma/
Oniguruma also supports UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, etc .... And it is portable, it's available both on Unix and Windows.
So I think it is best regex C library to choose backend.
Sorry, I didn't explain this so well. I mean an decoder to marshal a C-string that I know is UTF8 into a Haskell String (i.e. [Char]). An FFI call out to C might be convenient, but will have overhead. It's not that hard to write a UTF8 decoder (and encoder) in Haskell; I just wanted to avoid wasted work. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

On Sat, Feb 03, 2007 at 12:40:14AM +0900, shelarcy wrote:
Hi Alistar,
On Fri, 02 Feb 2007 21:01:04 +0900, Alistair Bayley
wrote: What is the state of UTF8 support in Haskell libraries (base or user-contributed)? I had a need for a UTF8 en & de-coder for Takusen, and after looking around couldn't find anything particularly satisfactory, so ended up writing (yet another) one.
regex-posix doesn't support UTF8. Because regex-posix uses POSIX regex. So this problem can't fixed by only correct UTF8 en & de-coder.
actually, conforming POSIX regular expressions work in the character encoding of the current locale, which is very likely utf8. In fact, personally, I would consider any system where it is still not utf8 (or perhaps ascii, which is compatable) at this day and age to be broken. (not that they don't exist) John -- John Meacham - ⑆repetae.net⑆john⑈
participants (4)
-
Alistair Bayley
-
Bayley, Alistair
-
John Meacham
-
shelarcy