
Hi, Am Dienstag, den 23.03.2010, 08:51 -0700 schrieb John Millikin:
On Tue, Mar 23, 2010 at 00:27, Johann Höchtl
wrote: How are ByteStrings (Lazy, UTF8) and Data.Text meant to co-exist? When I read bytestrings over a socket which happens to be UTF16-LE encoded and identify a fitting function in Data.Text, I guess I have to transcode them with Data.Text.Encoding to make the type System happy?
There's no such thing as a UTF8 or UTF16 bytestring -- a bytestring is just a more efficient encoding of [Word8], just as Text is a more efficient encoding of [Char]. If the file format you're parsing specifies that some series of bytes is text encoded as UTF16-LE, then you can use the Text decoders to convert to Text.
It wold still be useful to have an alternative to Data.Text that internally stores strings as UTF8 encoded bytestrings. I tried to switch from String to Data.Text in arbtt (which mostly calls pcre-light, which expects and returns UTF8-encoded C-strings), and it became slower! No surprise, considering that the program has to re-encode the strings all the time. Using a
newtype Text = Text { ByteString } with an interface akin to Data.Text, but using UTF8-encoded ByteStrings internally gave the same performance as String, at half the memory footprint. This is in an internal module¹ but I would find it handy to have this available as a common type in a well-supported library.
Greetings, Joachim ¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs -- Joachim "nomeata" Breitner mail: mail@joachim-breitner.de | ICQ# 74513189 | GPG-Key: 4743206C JID: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/ Debian Developer: nomeata@debian.org