Re: [Haskell-cafe] Bytestrings and [Char]

23 Mar 2010

      Hi,

Am Dienstag, den 23.03.2010, 08:51 -0700 schrieb John Millikin:
...
On Tue, Mar 23, 2010 at 00:27, Johann Höchtl  wrote:
...
How are ByteStrings (Lazy, UTF8) and Data.Text meant to co-exist? When I
read bytestrings over a socket which happens to be UTF16-LE encoded and
identify a fitting function in Data.Text, I guess I have to transcode them
with Data.Text.Encoding to make the type System happy?
There's no such thing as a UTF8 or UTF16 bytestring -- a bytestring is
just a more efficient encoding of [Word8], just as Text is a more
efficient encoding of [Char]. If the file format you're parsing
specifies that some series of bytes is text encoded as UTF16-LE, then
you can use the Text decoders to convert to Text.
It wold still be useful to have an alternative to Data.Text that
internally stores strings as UTF8 encoded bytestrings. I tried to switch
from String to Data.Text in arbtt (which mostly calls pcre-light, which
expects and returns UTF8-encoded C-strings), and it became slower! No
surprise, considering that the program has to re-encode the strings all
the time.

Using a
...
newtype Text = Text { ByteString }
with an interface akin to Data.Text, but using UTF8-encoded ByteStrings
internally gave the same performance as String, at half the memory
footprint. This is in an internal module¹ but I would find it handy to
have this available as a common type in a well-supported library.
Greetings,
Joachim

¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs

-- 
Joachim "nomeata" Breitner
  mail: mail@joachim-breitner.de | ICQ# 74513189 | GPG-Key: 4743206C
  JID: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/
  Debian Developer: nomeata@debian.org

Re: [Haskell-cafe] Bytestrings and [Char]

Joachim Breitner