
On Tue, Mar 23, 2010 at 08:51:16AM -0700, John Millikin wrote:
On Tue, Mar 23, 2010 at 00:27, Johann Höchtl
wrote: How are ByteStrings (Lazy, UTF8) and Data.Text meant to co-exist? When I read bytestrings over a socket which happens to be UTF16-LE encoded and identify a fitting function in Data.Text, I guess I have to transcode them with Data.Text.Encoding to make the type System happy?
There's no such thing as a UTF8 or UTF16 bytestring -- a bytestring is just a more efficient encoding of [Word8], just as Text is a more efficient encoding of [Char]. If the file format you're parsing specifies that some series of bytes is text encoded as UTF16-LE, then you can use the Text decoders to convert to Text.
Poor separation between bytes and characters has caused problems in many major languages (C, C++, PHP, Ruby, Python) -- lets not abandon the advantages of correctness to chase a few percentage points of performance.
I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points… I've been struggling with this problem too and it's not nice. Every time one uses the system readFile & friends (anything that doesn't read via ByteStrings), it hell slow. Test: read a file and compute its size in chars. Input text file is ~40MB in size, has one non-ASCII char. The test might seem stupid but it is a simple one. ghc 6.12.1. Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds, incorrect length (as expected). Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length. Data.Text.Lazy (system readFile + pack + length) - 26s, correct length. String (system readfile + length) - ~1 second, correct length. For the record: python2.6 (str type) - ~60ms, incorrect length. python3.1 (unicode) - ~310ms, correct length. If anyone has a solution on how to work on fast text (unicode) transformations (but not a 1:1 pipeline where fusion can work nicely), I'd be glad to hear. iustin