Bytestrings and [Char]

older
Fwd: [Haskell-cafe] Bytestrings...

Johann Höchtl

22 Mar 2010 22 Mar '10

12:16 p.m.

Hello, I was recentyl playing with Haskell (GHC that is) IO and text processing. Bytestrings and Lazy Bytestrings allow for fast and memory eficient string (well, bytestring) handling, yet a lot of libraries do not support them (yet) Given the incredibly inneficient memory representation of [Char] (16 bytes? per cell) I wonder wheather String should default to lazy batestring altogether instead of [Char]. The levenshtein distance as is on hackage uses e.g. String ([Char]) and as such is unneccessarily slow. My question or discussion point: Why not depreciate [Char] altogether and favour of lazy Bytestrings? Regards, Johann

Show replies by date

Ivan Lazar Miljenovic

22 Mar 22 Mar

12:46 p.m.

Johann Höchtl writes:

...

Bytestrings and Lazy Bytestrings allow for fast and memory eficient string (well, bytestring) handling, yet a lot of libraries do not support them (yet)

WHat do you mean? A lot of libraries need to use String because it's easier to deal with and doesn't rely on knowing which encoding is being used. Bytestring is often nicer for IO, but for internal library stuff (e.g. parsing) it may still be easier to use String.

...

Given the incredibly inneficient memory representation of [Char] (16 bytes? per cell) I wonder wheather String should default to lazy batestring altogether instead of [Char].

Well, then all pattern matching, etc. fails. String and ByteStrings have their own separate places and uses.

...

The levenshtein distance as is on hackage uses e.g. String ([Char]) and as such is unneccessarily slow.

Have you ported a version to ByteString and demonstrated that it is noticeably faster? Are you sure that it isn't the fault of the algorithm being used? Do you know that people only care about the levenshtein distance of ByteStrings and not of normal Strings? (Disclaimer: I've never even looked at the package.) Providing a ByteString version as well might be a viable option; replacing the current one is quite likely not.

...

My question or discussion point: Why not depreciate [Char] altogether and favour of lazy Bytestrings?

I believe I've answered this above. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Johan Tibell

1:10 p.m.

On Mon, Mar 22, 2010 at 1:16 PM, Johann Höchtl wrote:

...

My question or discussion point: Why not depreciate [Char] altogether and favour of lazy Bytestrings?

A sequence of bytes is not the same thing as a sequence of Unicode code points. If you want to replace String by something more efficient have a look at Data.Text. -- Johan

Ivan Miljenovic

1:15 p.m.

On 23 March 2010 00:10, Johan Tibell wrote:

...

A sequence of bytes is not the same thing as a sequence of Unicode code points. If you want to replace String by something more efficient have a look at Data.Text.

Though Data.Text still has the disadvantage of not being as nice to deal with as String, since you can't pattern match on it, etc. Whilst it may degrade performance, treating String as a list of characters rather than an array provides you with greater flexibility of how to deal with it. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Lennart Augustsson

4:20 p.m.

Turn on OverloadedStrings and you can pattern match on any type you like that is in the IsString class. Which means that Data.Text can use string literals just like regular strings (but you can't use Char literals in the match). On Mon, Mar 22, 2010 at 1:15 PM, Ivan Miljenovic wrote:

...

On 23 March 2010 00:10, Johan Tibell wrote:

...
A sequence of bytes is not the same thing as a sequence of Unicode code points. If you want to replace String by something more efficient have a look at Data.Text.

Though Data.Text still has the disadvantage of not being as nice to deal with as String, since you can't pattern match on it, etc.

Whilst it may degrade performance, treating String as a list of characters rather than an array provides you with greater flexibility of how to deal with it.

-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Johann Höchtl

23 Mar 23 Mar

7:31 a.m.

On 22.03.2010 14:15, Ivan Miljenovic wrote:

...

On 23 March 2010 00:10, Johan Tibell wrote:

...
A sequence of bytes is not the same thing as a sequence of Unicode code points. If you want to replace String by something more efficient have a look at Data.Text.

Though Data.Text still has the disadvantage of not being as nice to deal with as String, since you can't pattern match on it, etc.

Whilst it may degrade performance, treating String as a list of characters rather than an array provides you with greater flexibility of how to deal with it.

But doesn't this basically mean sacrifice performance / applicability for algorithmic beauty? Sure, pattern match comes in handy, but in the case of strings at a high price

Heinrich Apfelmus

9:32 a.m.

Ivan Miljenovic wrote:

...

Johan Tibell wrote:

...
A sequence of bytes is not the same thing as a sequence of Unicode code points. If you want to replace String by something more efficient have a look at Data.Text.

Though Data.Text still has the disadvantage of not being as nice to deal with as String, since you can't pattern match on it, etc.

Whilst it may degrade performance, treating String as a list of characters rather than an array provides you with greater flexibility of how to deal with it.

Indeed. In particular, [Char] and Data.Text.Text have different semantics and performance. For instance, cons :: Char -> Text -> Text is O(n) whereas (:) is O(1). (Not sure whether stream fusion for cons can change that at times.) Furthermore, certain programs relying on laziness, like ahh = 'a' : ahh fibs = ' ' : zipWith plus fibs (tail fibs) where plus x y = toEnum $ fromEnum x + fromEnum y are not possible with Data.Text.Text . (Whether you really need these is another question, of course.) Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

David Leimbach

22 Mar 22 Mar

2:27 p.m.

On Mon, Mar 22, 2010 at 6:10 AM, Johan Tibell wrote:

...

On Mon, Mar 22, 2010 at 1:16 PM, Johann Höchtl wrote:

...
My question or discussion point: Why not depreciate [Char] altogether and favour of lazy Bytestrings?

A sequence of bytes is not the same thing as a sequence of Unicode code points. If you want to replace String by something more efficient have a look at Data.Text.

Slight correction. A sequence of bytes is exactly the same thing as a sequence of Unicode bytes when you use UTF8.

...

-- Johan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Jochem Berndsen

2:42 p.m.

David Leimbach wrote:

...

On Mon, Mar 22, 2010 at 6:10 AM, Johan Tibell mailto:johan.tibell@gmail.com> wrote:

On Mon, Mar 22, 2010 at 1:16 PM, Johann Höchtl mailto:johann.hoechtl@gmail.com> wrote: > My question or discussion point: Why not depreciate [Char] altogether > and favour of lazy Bytestrings?

A sequence of bytes is not the same thing as a sequence of Unicode code points. If you want to replace String by something more efficient have a look at Data.Text.

Slight correction.

A sequence of bytes is exactly the same thing as a sequence of Unicode bytes when you use UTF8.

What is a "Unicode byte"? Cheers, Jochem -- Jochem Berndsen | jochem@functor.nl

Mads Lindstrøm

9:42 p.m.

Hi David Leimbach wrote:

...

On Mon, Mar 22, 2010 at 6:10 AM, Johan Tibell wrote: On Mon, Mar 22, 2010 at 1:16 PM, Johann Höchtl wrote: > My question or discussion point: Why not depreciate [Char] altogether > and favour of lazy Bytestrings?

A sequence of bytes is not the same thing as a sequence of Unicode code points. If you want to replace String by something more efficient have a look at Data.Text.

Slight correction.

A sequence of bytes is exactly the same thing as a sequence of Unicode bytes when you use UTF8.

And a sequence of bytes is exactly the same thing as a UTF-32 string, when the sequence is encoded as UTF-32. But that is not the point. The point is that when some function is handed a ByteString, the ByteString can be encoded in many ways. It do not even have to be text. So you need a UTF-8 String type. /Mads

...

-- Johan

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Johann Höchtl

23 Mar 23 Mar

7:27 a.m.

On 22.03.2010 14:10, Johan Tibell wrote:

...

On Mon, Mar 22, 2010 at 1:16 PM, Johann Höchtl wrote:

...
My question or discussion point: Why not depreciate [Char] altogether and favour of lazy Bytestrings?

A sequence of bytes is not the same thing as a sequence of Unicode code points. If you want to replace String by something more efficient have a look at Data.Text.

How are ByteStrings (Lazy, UTF8) and Data.Text meant to co-exist? When I read bytestrings over a socket which happens to be UTF16-LE encoded and identify a fitting function in Data.Text, I guess I have to transcode them with Data.Text.Encoding to make the type System happy?

...

-- Johan

John Millikin

3:51 p.m.

On Tue, Mar 23, 2010 at 00:27, Johann Höchtl wrote:

...

How are ByteStrings (Lazy, UTF8) and Data.Text meant to co-exist? When I read bytestrings over a socket which happens to be UTF16-LE encoded and identify a fitting function in Data.Text, I guess I have to transcode them with Data.Text.Encoding to make the type System happy?

There's no such thing as a UTF8 or UTF16 bytestring -- a bytestring is just a more efficient encoding of [Word8], just as Text is a more efficient encoding of [Char]. If the file format you're parsing specifies that some series of bytes is text encoded as UTF16-LE, then you can use the Text decoders to convert to Text. Poor separation between bytes and characters has caused problems in many major languages (C, C++, PHP, Ruby, Python) -- lets not abandon the advantages of correctness to chase a few percentage points of performance.

Joachim Breitner

4:53 p.m.

Hi, Am Dienstag, den 23.03.2010, 08:51 -0700 schrieb John Millikin:

...

On Tue, Mar 23, 2010 at 00:27, Johann Höchtl wrote:

...
How are ByteStrings (Lazy, UTF8) and Data.Text meant to co-exist? When I read bytestrings over a socket which happens to be UTF16-LE encoded and identify a fitting function in Data.Text, I guess I have to transcode them with Data.Text.Encoding to make the type System happy?

There's no such thing as a UTF8 or UTF16 bytestring -- a bytestring is just a more efficient encoding of [Word8], just as Text is a more efficient encoding of [Char]. If the file format you're parsing specifies that some series of bytes is text encoded as UTF16-LE, then you can use the Text decoders to convert to Text.

It wold still be useful to have an alternative to Data.Text that internally stores strings as UTF8 encoded bytestrings. I tried to switch from String to Data.Text in arbtt (which mostly calls pcre-light, which expects and returns UTF8-encoded C-strings), and it became slower! No surprise, considering that the program has to re-encode the strings all the time. Using a

...

newtype Text = Text { ByteString } with an interface akin to Data.Text, but using UTF8-encoded ByteStrings internally gave the same performance as String, at half the memory footprint. This is in an internal module¹ but I would find it handy to have this available as a common type in a well-supported library.

Greetings, Joachim ¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs -- Joachim "nomeata" Breitner mail: mail@joachim-breitner.de | ICQ# 74513189 | GPG-Key: 4743206C JID: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/ Debian Developer: nomeata@debian.org

Iustin Pop

5:11 p.m.

On Tue, Mar 23, 2010 at 08:51:16AM -0700, John Millikin wrote:

...

On Tue, Mar 23, 2010 at 00:27, Johann Höchtl wrote:

...
How are ByteStrings (Lazy, UTF8) and Data.Text meant to co-exist? When I read bytestrings over a socket which happens to be UTF16-LE encoded and identify a fitting function in Data.Text, I guess I have to transcode them with Data.Text.Encoding to make the type System happy?

There's no such thing as a UTF8 or UTF16 bytestring -- a bytestring is just a more efficient encoding of [Word8], just as Text is a more efficient encoding of [Char]. If the file format you're parsing specifies that some series of bytes is text encoded as UTF16-LE, then you can use the Text decoders to convert to Text.

Poor separation between bytes and characters has caused problems in many major languages (C, C++, PHP, Ruby, Python) -- lets not abandon the advantages of correctness to chase a few percentage points of performance.

I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points… I've been struggling with this problem too and it's not nice. Every time one uses the system readFile & friends (anything that doesn't read via ByteStrings), it hell slow. Test: read a file and compute its size in chars. Input text file is ~40MB in size, has one non-ASCII char. The test might seem stupid but it is a simple one. ghc 6.12.1. Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds, incorrect length (as expected). Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length. Data.Text.Lazy (system readFile + pack + length) - 26s, correct length. String (system readfile + length) - ~1 second, correct length. For the record: python2.6 (str type) - ~60ms, incorrect length. python3.1 (unicode) - ~310ms, correct length. If anyone has a solution on how to work on fast text (unicode) transformations (but not a 1:1 pipeline where fusion can work nicely), I'd be glad to hear. iustin

Nick Bowler

5:21 p.m.

On 18:11 Tue 23 Mar , Iustin Pop wrote:

...

I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points…

I've been struggling with this problem too and it's not nice. Every time one uses the system readFile & friends (anything that doesn't read via ByteStrings), it hell slow.

Test: read a file and compute its size in chars. Input text file is ~40MB in size, has one non-ASCII char. The test might seem stupid but it is a simple one. ghc 6.12.1.

Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds, incorrect length (as expected).

Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length.

Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.

String (system readfile + length) - ~1 second, correct length.

Is this a mistake? Your own report shows String & readFile being an order of magnitude faster than everything else, contrary to your earlier claim. -- Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

Iustin Pop

5:25 p.m.

On Tue, Mar 23, 2010 at 01:21:49PM -0400, Nick Bowler wrote:

...

On 18:11 Tue 23 Mar , Iustin Pop wrote:

...
I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points…

I've been struggling with this problem too and it's not nice. Every time one uses the system readFile & friends (anything that doesn't read via ByteStrings), it hell slow.

Test: read a file and compute its size in chars. Input text file is ~40MB in size, has one non-ASCII char. The test might seem stupid but it is a simple one. ghc 6.12.1.

Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds, incorrect length (as expected).

Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length.

Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.

String (system readfile + length) - ~1 second, correct length.

Is this a mistake? Your own report shows String & readFile being an order of magnitude faster than everything else, contrary to your earlier claim.

No, it's not a mistake. String is faster than pack to Text and length, but it's 100 times slower than ByteString. My whole point is that difference between byte processing and char processing in Haskell is not a few percentage points, but order of magnitude. I would really like to have only the 6x penalty that Python shows, for example. regards, iustin

Nick Bowler

7:31 p.m.

On 18:25 Tue 23 Mar , Iustin Pop wrote:

...

On Tue, Mar 23, 2010 at 01:21:49PM -0400, Nick Bowler wrote:

...
On 18:11 Tue 23 Mar , Iustin Pop wrote:

...
I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points…

I've been struggling with this problem too and it's not nice. Every time one uses the system readFile & friends (anything that doesn't read via ByteStrings), it hell slow.

Test: read a file and compute its size in chars. Input text file is ~40MB in size, has one non-ASCII char. The test might seem stupid but it is a simple one. ghc 6.12.1.

Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds, incorrect length (as expected).

Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length.

Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.

String (system readfile + length) - ~1 second, correct length.

Is this a mistake? Your own report shows String & readFile being an order of magnitude faster than everything else, contrary to your earlier claim.

No, it's not a mistake. String is faster than pack to Text and length, but it's 100 times slower than ByteString.

Only if you don't care about obtaining the correct answer, in which case you may as well just say const 42 or somesuch, which is even faster.

...

My whole point is that difference between byte processing and char processing in Haskell is not a few percentage points, but order of magnitude. I would really like to have only the 6x penalty that Python shows, for example.

Hang on a second... less than 10 milliseconds to read 40 megabytes from disk? Something's fishy here. I ran my own tests with a 400M file (419430400 bytes) consisting almost exclusively of the letter 'a' with two Japanese characters placed at every multiple of 40 megabytes (UTF-8 encoded). With Prelude.readFile/length and 5 runs, I see 10145ms, 10087ms, 10223ms, 10321ms, 10216ms. with approximately 10% of that time spent performing GC each run. With Data.Bytestring.Lazy.readFile/length and 5 runs, I see 8223ms, 8192ms, 8077ms, 8091ms, 8174ms. with approximately 20% of that time spent performing GC each run. Maybe there's some magic command line options to tune the GC for our purposes, but I only managed to make things slower. Thus, I'll handwave a bit and just shave off the GC time from each result. Prelude: 9178ms mean with a standard deviation of 159ms. Data.ByteString.Lazy: 6521ms mean with a standard deviation of 103ms. Therefore, we managed a throughput of 43 MB/s with the Prelude (and got the right answer), while we managed 61 MB/s with lazy ByteStrings (and got the wrong answer). My disk won't go much, if at all, faster than the second result, so that's good. So that's a 30% reduction in throughput. I'd say that's a lot worse than a few percentage points, but certainly not orders of magnitude. On the other hand, using Data.ByteString.Lazy.readFile and Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with approximately 5% of that time spent in GC, which is rather worse than the Prelude. Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are even worse, with results of around 25 *seconds* (!!) and 2% of that time spent in GC. GNU wc computes the correct answer as quickly as lazy bytestrings compute the wrong answer. With perl 5.8, slurping the entire file as UTF-8 computes the correct answer just as slowly as Prelude. In my first ever Python program (with python 2.6), I tried to read the entire file as a unicode string and it quickly crashes due to running out of memory (yikes!), so it earns a DNF. So, for computing the right answer with this simple test, it looks like the Prelude is the best option. We tie with Perl and lose only to GNU wc (which is written in C). Really, though, it would be nice to close that gap. -- Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

Iustin Pop

7:50 p.m.

On Tue, Mar 23, 2010 at 03:31:33PM -0400, Nick Bowler wrote:

...

On 18:25 Tue 23 Mar , Iustin Pop wrote:

...
On Tue, Mar 23, 2010 at 01:21:49PM -0400, Nick Bowler wrote:

...
On 18:11 Tue 23 Mar , Iustin Pop wrote:

...
I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points…

I've been struggling with this problem too and it's not nice. Every time one uses the system readFile & friends (anything that doesn't read via ByteStrings), it hell slow.

Test: read a file and compute its size in chars. Input text file is ~40MB in size, has one non-ASCII char. The test might seem stupid but it is a simple one. ghc 6.12.1.

Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds, incorrect length (as expected).

Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length.

Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.

String (system readfile + length) - ~1 second, correct length.

Is this a mistake? Your own report shows String & readFile being an order of magnitude faster than everything else, contrary to your earlier claim.

No, it's not a mistake. String is faster than pack to Text and length, but it's 100 times slower than ByteString.

Only if you don't care about obtaining the correct answer, in which case you may as well just say const 42 or somesuch, which is even faster.

...
My whole point is that difference between byte processing and char processing in Haskell is not a few percentage points, but order of magnitude. I would really like to have only the 6x penalty that Python shows, for example.

Hang on a second... less than 10 milliseconds to read 40 megabytes from disk? Something's fishy here.

Of course I don't want to benchmark the disk, and therefore the source file is on tmpfs.

...

I ran my own tests with a 400M file (419430400 bytes) consisting almost exclusively of the letter 'a' with two Japanese characters placed at every multiple of 40 megabytes (UTF-8 encoded).

With Prelude.readFile/length and 5 runs, I see

10145ms, 10087ms, 10223ms, 10321ms, 10216ms.

with approximately 10% of that time spent performing GC each run.

With Data.Bytestring.Lazy.readFile/length and 5 runs, I see

8223ms, 8192ms, 8077ms, 8091ms, 8174ms.

with approximately 20% of that time spent performing GC each run. Maybe there's some magic command line options to tune the GC for our purposes, but I only managed to make things slower. Thus, I'll handwave a bit and just shave off the GC time from each result.

Prelude: 9178ms mean with a standard deviation of 159ms. Data.ByteString.Lazy: 6521ms mean with a standard deviation of 103ms.

Therefore, we managed a throughput of 43 MB/s with the Prelude (and got the right answer), while we managed 61 MB/s with lazy ByteStrings (and got the wrong answer). My disk won't go much, if at all, faster than the second result, so that's good.

I'll bet that for a 400MB file, if you have more than two 2GB of ram, most of it will be cached. If you want to check Haskell performance, just copy it to a tmpfs filesytem so that the disk is out of the equation.

...

So that's a 30% reduction in throughput. I'd say that's a lot worse than a few percentage points, but certainly not orders of magnitude.

Because you're possibly benchmarking the disk also. With a 400MB file on tmpfs, lazy bytestring readfile + length takes on my machine ~150ms, which is way faster than 8 seconds…

...

On the other hand, using Data.ByteString.Lazy.readFile and Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with approximately 5% of that time spent in GC, which is rather worse than the Prelude. Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are even worse, with results of around 25 *seconds* (!!) and 2% of that time spent in GC.

GNU wc computes the correct answer as quickly as lazy bytestrings compute the wrong answer. With perl 5.8, slurping the entire file as UTF-8 computes the correct answer just as slowly as Prelude. In my first ever Python program (with python 2.6), I tried to read the entire file as a unicode string and it quickly crashes due to running out of memory (yikes!), so it earns a DNF.

So, for computing the right answer with this simple test, it looks like the Prelude is the best option. We tie with Perl and lose only to GNU wc (which is written in C). Really, though, it would be nice to close that gap.

Totally agreed :) iustin

TeXitoi

9:42 p.m.

Iustin Pop writes:

...

On Tue, Mar 23, 2010 at 03:31:33PM -0400, Nick Bowler wrote:

...
So that's a 30% reduction in throughput. I'd say that's a lot worse than a few percentage points, but certainly not orders of magnitude.

Because you're possibly benchmarking the disk also. With a 400MB file on tmpfs, lazy bytestring readfile + length takes on my machine ~150ms, which is way faster than 8 seconds…

If you read the source code, length do not read the data, that's why it is so fast. It cannot be done for UTF-8 strings.

...

From Data.ByteString.Lazy:

-- | /O(n\/c)/ 'length' returns the length of a ByteString as an -- | 'Int64' length :: ByteString -> Int64 length cs = foldlChunks (\n c -> n + fromIntegral (S.length c)) 0 cs {-# INLINE length #-}

...

...
On the other hand, using Data.ByteString.Lazy.readFile and Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with approximately 5% of that time spent in GC, which is rather worse than the Prelude. Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are even worse, with results of around 25 *seconds* (!!) and 2% of that time spent in GC.

GNU wc computes the correct answer as quickly as lazy bytestrings compute the wrong answer. With perl 5.8, slurping the entire file as UTF-8 computes the correct answer just as slowly as Prelude. In my first ever Python program (with python 2.6), I tried to read the entire file as a unicode string and it quickly crashes due to running out of memory (yikes!), so it earns a DNF.

So, for computing the right answer with this simple test, it looks like the Prelude is the best option. We tie with Perl and lose only to GNU wc (which is written in C). Really, though, it would be nice to close that gap.

Totally agreed :)

texitoi@flyeeeng:~$ ./wc-utf8 /dev/shm/haskell-utf8.txt Normal String + System.IO "60452700": 5.575169s Data.ByteString.Lazy "61965200": 0.088136s Data.ByteString.Lazy.UTF8 "60452700": 13.953714s Cheating a little bit "60452700": 9.307322s Data.Text.Lazy "60452700": 15.608354s texitoi@flyeeeng:~$ time wc /dev/shm/haskell-utf8.txt 1329900 8065200 61965200 /dev/shm/haskell-utf8.txt real 0m9.303s user 0m9.089s sys 0m0.152s texitoi@flyeeeng:~$ Hey, normal string way faster than GNU wc! Cheat sheet, using Data.ByteString.Lazy: myLength :: U.ByteString -> Int myLength b = loop 0 b where loop n xs = case readChar xs of Just m -> let n' = n+1 in n' `seq` loop n' (L.drop m xs) Nothing -> n readChar :: L.ByteString -> Maybe Int64 readChar bs = do (c,_) <- L.uncons bs return (choose (fromEnum c)) where choose :: Int -> Int64 choose c | c < 0xc0 = 1 | c < 0xe0 = 2 | c < 0xf0 = 3 | c < 0xf8 = 4 | otherwise = 1 inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it is cheating because it do not check the validity of the multibyte char). Using Debian testing, ghc 6.12.1 on Atom N270 @ 1.6GHz. The file is a repeated LaTeX UTF8 file of about 60MB. -- Guillaume Pinot http://www.irccyn.ec-nantes.fr/~pinot/ « Les grandes personnes ne comprennent jamais rien toutes seules, et c'est fatigant, pour les enfants, de toujours leur donner des explications... » -- Antoine de Saint-Exupéry, Le Petit Prince () ASCII ribbon campaign -- Against HTML e-mail /\ http://www.asciiribbon.org -- Against proprietary attachments

Thomas DuBuisson

10:01 p.m.

...

If you read the source code, length do not read the data, that's why it is so fast. It cannot be done for UTF-8 strings.

I think at this point most the amazement is directed at Data.Text being slower than good old [Char] (at least for this operation - we should probably expand our view to more than one operation).

...

Hey, normal string way faster than GNU wc!

No - you need to perform a fair comparison. Try "wc -c" to only count characters (not lines and words too). I'd provide numbers but my wc doesn't seem to support UTF-8 and not sure what package contains a unicode aware wc.

...

readChar :: L.ByteString -> Maybe Int64 readChar bs = do (c,_) <- L.uncons bs return (choose (fromEnum c)) where choose :: Int -> Int64 choose c | c < 0xc0 = 1 | c < 0xe0 = 2 | c < 0xf0 = 3 | c < 0xf8 = 4 | otherwise = 1

inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it is cheating because it do not check the validity of the multibyte char).

Ah, interesting and a worth-while cheat. Thomas

Bryan O'Sullivan

6:22 p.m.

2010/3/23 Iustin Pop

...

I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points…

Well, your benchmarks are highly suspect. See below.

...

Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length.

You should be using bytestring I/O for this.

...

Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.

You should be using text I/O for this.

Thomas DuBuisson

7:19 p.m.

BOS:

...

Well, your benchmarks are highly suspect.

Attached is another benchmark with similar results. This is no criterion benchmark but I did try to adjust a wee bit for cache issues. Suffice to say I am not yet impressed with Data.Text performance wise. In the broader scope I feel there is a deeper point here: Data.ByteString is how most data is available in a program (be it from a file, network, or other library/device). With all this data we really should have some sort of zero-copy "safeCoerce" that can expose the bytestring as an O(1) unboxed array of some safe length. I know this would have been nice for pureMD5, which did use an ugly unsafePerformIO hack just to get Word16s but even those got boxed up at horrible cost (it now uses 'cereal' to get the Word16 - even worse performance but it lets me ignore endianess of the architecture). ---- OT ---- Beating the dead horse: I once wrote and deleted a blog post ranting about how to get Word16 in C vs in Haskell: word = ((uint16_t *)p)[index]; or word = unsafePerformIO $ withForeignPtr ptr $ \ptr' -> let p = castPtr (plusPtr ptr' off) in peekElemOff p index That Haskell snippet takes significantly longer to both write and run. ---- END OT ---- Code and benchmark numbers included below, all complaints about the accuracy of the benchmark are welcome. NOTE: tLog is a daily #haskell log file repeated many times ~ 59MB. ------- [tommd@Mavlo Test]$ ./read tLog Normal String + System.IO "61443120": 1.467924s Data.ByteString.Lazy "61450365": 0.023285s Data.ByteString.Lazy.UTF8 "61443120": 3.305154s Data.Text.Lazy "61443120": 3.99178s ----- CODE ----- import Data.Time import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.UTF8 as U import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as TO import System.IO import System.Environment (getArgs) import Control.Monad (sequence_, when) main = do [file] <- getArgs let time (f, desc) = do s <- getCurrentTime r <- f let !r' = r t <- getCurrentTime let d = diffUTCTime t s when (length desc > 0) (putStrLn $ desc ++ " " ++ show r' ++ ": " ++ show d) ops = [ (readFile file >>= return . show . length, "") , (readFile file >>= return . show . length, "Normal String + System.IO") , (L.readFile file >>= return . show . L.length, "") , (L.readFile file >>= return . show . L.length, "Data.ByteString.Lazy") , (L.readFile file >>= return . show . U.length, "") , (L.readFile file >>= return . show . U.length, "Data.ByteString.Lazy.UTF8") , (TO.readFile file >>= return . show . T.length, "") , (TO.readFile file >>= return . show . T.length, "Data.Text.Lazy") ] mapM_ time ops

Iustin Pop

7:52 p.m.

On Tue, Mar 23, 2010 at 11:22:23AM -0700, Bryan O'Sullivan wrote:

...

2010/3/23 Iustin Pop

...
I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points…

Well, your benchmarks are highly suspect. See below.

...
Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length.

You should be using bytestring I/O for this.

As I was told, indeed :)

...

...
Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.

You should be using text I/O for this.

I didn't realize Data.Text.IO exists. Thanks for the hint! iustin

Johann Höchtl

7:59 p.m.

Bryan O'Sullivan wrote:

...

2010/3/23 Iustin Pop mailto:iusty@k1024.org>

I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points…

Well, your benchmarks are highly suspect. See below.

Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length.

You should be using bytestring I/O for this.

Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.

You should be using text I/O for this. This is certainly true but adds some unfortunate bloat and has the taste of Java libraries, which define a plethora of similar, but not interchangeable methods due to legacy.

5671

Age (days ago)

5672

Last active (days ago)

List overview

Download

23 comments

16 participants

participants (16)

Bryan O'Sullivan
David Leimbach
Heinrich Apfelmus
Iustin Pop
Ivan Lazar Miljenovic
Ivan Miljenovic
Joachim Breitner
Jochem Berndsen
Johan Tibell
Johann Höchtl
John Millikin
Lennart Augustsson
Mads Lindstrøm
Nick Bowler
TeXitoi
Thomas DuBuisson