
On Tue, Mar 23, 2010 at 03:31:33PM -0400, Nick Bowler wrote:
On 18:25 Tue 23 Mar , Iustin Pop wrote:
On Tue, Mar 23, 2010 at 01:21:49PM -0400, Nick Bowler wrote:
On 18:11 Tue 23 Mar , Iustin Pop wrote:
I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points…
I've been struggling with this problem too and it's not nice. Every time one uses the system readFile & friends (anything that doesn't read via ByteStrings), it hell slow.
Test: read a file and compute its size in chars. Input text file is ~40MB in size, has one non-ASCII char. The test might seem stupid but it is a simple one. ghc 6.12.1.
Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds, incorrect length (as expected).
Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length.
Data.Text.Lazy (system readFile + pack + length) - 26s, correct length.
String (system readfile + length) - ~1 second, correct length.
Is this a mistake? Your own report shows String & readFile being an order of magnitude faster than everything else, contrary to your earlier claim.
No, it's not a mistake. String is faster than pack to Text and length, but it's 100 times slower than ByteString.
Only if you don't care about obtaining the correct answer, in which case you may as well just say const 42 or somesuch, which is even faster.
My whole point is that difference between byte processing and char processing in Haskell is not a few percentage points, but order of magnitude. I would really like to have only the 6x penalty that Python shows, for example.
Hang on a second... less than 10 milliseconds to read 40 megabytes from disk? Something's fishy here.
Of course I don't want to benchmark the disk, and therefore the source file is on tmpfs.
I ran my own tests with a 400M file (419430400 bytes) consisting almost exclusively of the letter 'a' with two Japanese characters placed at every multiple of 40 megabytes (UTF-8 encoded).
With Prelude.readFile/length and 5 runs, I see
10145ms, 10087ms, 10223ms, 10321ms, 10216ms.
with approximately 10% of that time spent performing GC each run.
With Data.Bytestring.Lazy.readFile/length and 5 runs, I see
8223ms, 8192ms, 8077ms, 8091ms, 8174ms.
with approximately 20% of that time spent performing GC each run. Maybe there's some magic command line options to tune the GC for our purposes, but I only managed to make things slower. Thus, I'll handwave a bit and just shave off the GC time from each result.
Prelude: 9178ms mean with a standard deviation of 159ms. Data.ByteString.Lazy: 6521ms mean with a standard deviation of 103ms.
Therefore, we managed a throughput of 43 MB/s with the Prelude (and got the right answer), while we managed 61 MB/s with lazy ByteStrings (and got the wrong answer). My disk won't go much, if at all, faster than the second result, so that's good.
I'll bet that for a 400MB file, if you have more than two 2GB of ram, most of it will be cached. If you want to check Haskell performance, just copy it to a tmpfs filesytem so that the disk is out of the equation.
So that's a 30% reduction in throughput. I'd say that's a lot worse than a few percentage points, but certainly not orders of magnitude.
Because you're possibly benchmarking the disk also. With a 400MB file on tmpfs, lazy bytestring readfile + length takes on my machine ~150ms, which is way faster than 8 seconds…
On the other hand, using Data.ByteString.Lazy.readFile and Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with approximately 5% of that time spent in GC, which is rather worse than the Prelude. Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are even worse, with results of around 25 *seconds* (!!) and 2% of that time spent in GC.
GNU wc computes the correct answer as quickly as lazy bytestrings compute the wrong answer. With perl 5.8, slurping the entire file as UTF-8 computes the correct answer just as slowly as Prelude. In my first ever Python program (with python 2.6), I tried to read the entire file as a unicode string and it quickly crashes due to running out of memory (yikes!), so it earns a DNF.
So, for computing the right answer with this simple test, it looks like the Prelude is the best option. We tie with Perl and lose only to GNU wc (which is written in C). Really, though, it would be nice to close that gap.
Totally agreed :) iustin