[RFC] benchmarks of bytestrings, teaser

newer
Newbie question: can laziness lead...

older
Problem building hdbc-sqlite3 with...

Peter Lund

15 Dec 2007 15 Dec '07

8:25 a.m.

What do you think the relative speeds are of the six small haskell programs at the end of this email? All they do is read from stdin and count the number of spaces they see. There are two that use strict bytestrings, two that use lazy bytestrings, and two that use the standard Haskell strings. Three use a recursive function with an accumulator parameter and three use a foldl with a lambda function. Say the fastest one takes the time 1. How much time will the others take? And how about memory? How much memory do you think they require? Let's say we feed a 150MB(*) file into each of them, how many megabytes do you think they end up using (as seen from the OS, not in terms of how big the live heap is)? I'm going to post full benchmarks + analysis on Wednesday. -Peter *) hardddisk megabytes. The file is 150000034 bytes ≈ 143 mebibytes. PS: For extra credit, what do you think is the peak memory use for this program when given an input file of 150MB? {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B import GHC.Int (Int64) -- note that D.BS.Lazy.Char8.length is ByteString -> Int64 -- D.BS.C8.length is ByteString -> Int cnt :: B.ByteString -> Int64 cnt bs = B.length (B.filter (== ' ') bs) main = do s <- B.getContents print (cnt s) ============================== hs/space-bs-c8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) ============================== hs/space-bslc8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) ============================== hs/space-xxxxx-acc-1.hs: {-# LANGUAGE BangPatterns #-} cnt :: Int -> String -> Int cnt !acc bs = if null bs then acc else cnt (if head bs == ' ' then acc+1 else acc) (tail bs) main = do s <- getContents print (cnt 0 s) ============================== hs/space-bs-c8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) ============================== hs/space-bslc8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) ============================== hs/space-xxxxx-foldl.hs: {-# LANGUAGE BangPatterns #-} cnt :: String -> Int cnt bs = foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- getContents print (cnt s) ==============================

Show replies by date

Duncan Coutts

15 Dec 15 Dec

2:34 p.m.

On Sat, 2007-12-15 at 09:25 +0100, Peter Lund wrote:

...

What do you think the relative speeds are of the six small haskell programs at the end of this email?

Ok, I presume this is a guessing game and we're supposed to just look at the code without running and timing them.

...

All they do is read from stdin and count the number of spaces they see. There are two that use strict bytestrings, two that use lazy bytestrings, and two that use the standard Haskell strings. Three use a recursive function with an accumulator parameter and three use a foldl with a lambda function.

Say the fastest one takes the time 1. How much time will the others take?

And how about memory? How much memory do you think they require? Let's say we feed a 150MB(*) file into each of them, how many megabytes do you think they end up using (as seen from the OS, not in terms of how big the live heap is)?

I'm going to post full benchmarks + analysis on Wednesday.

Right'o. I'll have a go. Lets see if I can't embarrass myself with being completely inaccurate.

...

PS: For extra credit, what do you think is the peak memory use for this program when given an input file of 150MB?

Ok.

...

{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B import GHC.Int (Int64)

-- note that D.BS.Lazy.Char8.length is ByteString -> Int64 -- D.BS.C8.length is ByteString -> Int

Yes, because strict bytestring cannot be bigger than the size of virtual memory and with ghc at least, Int tracks the size of the machine pointer.

...

cnt :: B.ByteString -> Int64 cnt bs = B.length (B.filter (== ' ') bs)

main = do s <- B.getContents print (cnt s)

Hmm. So that should work in constant memory, a few 64 chunks at once. I'd expect this to be pretty fast.

...

============================== hs/space-bs-c8-acc-1.hs: {-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Char8 as B

cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)

main = do s <- B.getContents print (cnt 0 s)

This uses strict bytestrings so will use at least 150Mb and that'll make it a good deal slower. In fact it'll be worse than that since getContents does not know in advance how big the input will be so it has to play the doubling and copying game. So it'll end up copying all the data roughly twice. cnt is strict and tail recursive so that shouldn't be any problem, though it's probably not as fast as the first length . filter since head, tail, null all have to do bounds checks.

...

============================== hs/space-bslc8-acc-1.hs: {-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B

cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)

main = do s <- B.getContents print (cnt 0 s)

For the same reason as above, I'd expect this cnt to be slower than B.length . B.filter (== ' ')

...

============================== hs/space-xxxxx-acc-1.hs: {-# LANGUAGE BangPatterns #-}

cnt :: Int -> String -> Int cnt !acc bs = if null bs then acc else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)

main = do s <- getContents print (cnt 0 s)

Lazy, so constant memory use, but much higher constant factors due to using String.

...

============================== hs/space-bs-c8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Char8 as B

cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs

main = do s <- B.getContents print (cnt s)

This is of course still strict so that's going to make the reading slow. This is a manually fused B.length . B.filter (== ' ') which hopefully is the same speed as the automatically fused one if the fusion is working ok. If not, then the B.length . B.filter (== ' ') will be doing a extra copy, and memory writes are expensive.

...

============================== hs/space-bslc8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B

cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs

main = do s <- B.getContents print (cnt s)

As above but now in constant memory space.

...

============================== hs/space-xxxxx-foldl.hs: {-# LANGUAGE BangPatterns #-}

cnt :: String -> Int cnt bs = foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs

main = do s <- getContents print (cnt s)

Oh, no! not foldl that's a killer. Ok, so best way to summarise I think is to organise by data type since I think that'll dominate. So I think the lazy bytestring versions will be fastest due to having the best memory access patterns and doing the least copying. I think the foldl's will be faster than the explicit accumulators due to having fewer bounds checks. space-bslc8-foldlx-1 space-bslc8-acc-1 space-bs-c8-foldlx-1 space-bs-c8-acc-1 space-xxxxx-acc-1 space-xxxxx-foldl I'll try guessing at some ratios: 1.0 space-bslc8-foldlx-1 1.1 space-bslc8-acc-1 2.0 space-bs-c8-foldlx-1 2.1 space-bs-c8-acc-1 4.0 space-xxxxx-acc-1 15 space-xxxxx-foldl Duncan

Peter Lund

3:19 p.m.

On Sat, 2007-12-15 at 14:34 +0000, Duncan Coutts wrote:

...

Ok, I presume this is a guessing game and we're supposed to just look at the code without running and timing them.

Precisely :)

...

...
All they do is read from stdin and count the number of spaces they see. There are two that use strict bytestrings, two that use lazy bytestrings, and two that use the standard Haskell strings. Three use a recursive function with an accumulator parameter and three use a foldl with a lambda function.

Say the fastest one takes the time 1. How much time will the others take?

And how about memory? How much memory do you think they require? Let's say we feed a 150MB(*) file into each of them, how many megabytes do you think they end up using (as seen from the OS, not in terms of how big the live heap is)?

I'm going to post full benchmarks + analysis on Wednesday.

Right'o. I'll have a go. Lets see if I can't embarrass myself with being completely inaccurate.

Thanks for biting! You were, thankfully, only almost completely inaccurate ;)

...

...
PS: For extra credit, what do you think is the peak memory use for this program when given an input file of 150MB?

...

Hmm. So that should work in constant memory, a few 64 chunks at once. I'd expect this to be pretty fast.

You are right about the speed. Can you guess a number in kilobytes?

...

...
============================== hs/space-bs-c8-acc-1.hs: {-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Char8 as B

cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)

main = do s <- B.getContents print (cnt 0 s)

This uses strict bytestrings so will use at least 150Mb and that'll make it a good deal slower. In fact it'll be worse than that since getContents does not know in advance how big the input will be so it has to play the doubling and copying game. So it'll end up copying all the data roughly twice. cnt is strict and tail recursive so that shouldn't be any problem, though it's probably not as fast as the first length . filter since head, tail, null all have to do bounds checks.

You are right about the memory. It is actually slightly faster than the "extra credit" (length/filter combination) above.

...

...
============================== hs/space-bslc8-acc-1.hs: {-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B

cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)

main = do s <- B.getContents print (cnt 0 s)

For the same reason as above, I'd expect this cnt to be slower than B.length . B.filter (== ' ')

It is slower but not for the same reason as above.

...

...
============================== hs/space-xxxxx-acc-1.hs: {-# LANGUAGE BangPatterns #-}

cnt :: Int -> String -> Int cnt !acc bs = if null bs then acc else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)

main = do s <- getContents print (cnt 0 s)

Lazy, so constant memory use, but much higher constant factors due to using String.

Spot on.

...

...
============================== hs/space-bs-c8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Char8 as B

cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs

main = do s <- B.getContents print (cnt s)

This is of course still strict so that's going to make the reading slow.

Nope.

...

This is a manually fused B.length . B.filter (== ' ') which hopefully is the same speed as the automatically fused one if the fusion is working ok. If not, then the B.length . B.filter (== ' ') will be doing a extra copy, and memory writes are expensive.

...
============================== hs/space-bslc8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B

cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs

main = do s <- B.getContents print (cnt s)

As above but now in constant memory space.

Nope.

...

...
============================== hs/space-xxxxx-foldl.hs: {-# LANGUAGE BangPatterns #-}

cnt :: String -> Int cnt bs = foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs

main = do s <- getContents print (cnt s)

Oh, no! not foldl that's a killer.

You think it's worse than the program just above?

...

Ok, so best way to summarise I think is to organise by data type since I think that'll dominate.

So I think the lazy bytestring versions will be fastest due to having the best memory access patterns and doing the least copying. I think the foldl's will be faster than the explicit accumulators due to having fewer bounds checks.

space-bslc8-foldlx-1 space-bslc8-acc-1

space-bs-c8-foldlx-1 space-bs-c8-acc-1

space-xxxxx-acc-1 space-xxxxx-foldl

I'll try guessing at some ratios:

1.0 space-bslc8-foldlx-1 1.1 space-bslc8-acc-1

2.0 space-bs-c8-foldlx-1 2.1 space-bs-c8-acc-1

4.0 space-xxxxx-acc-1 15 space-xxxxx-foldl

I've done the measurements on a 2GHz Athlon64 3000+, a 1667 MHz Core Duo, and a 600MHz Pentium III. They all show the same pattern (with a few minor aberrations for the PIII). You did got one of the relative speeds right ;) (and some of the memory usages) I've tested with ghc 6.8.1 and 6.9.20071119 and 6.9.20071208 (or thereabouts). 6.6.1 won't run my benchmarks and it also won't let me install bytestring-0.9.0.1 to replace its built-in version. -Peter

Don Stewart

7:59 p.m.

firefly:

...

What do you think the relative speeds are of the six small haskell programs at the end of this email?

All they do is read from stdin and count the number of spaces they see. There are two that use strict bytestrings, two that use lazy bytestrings, and two that use the standard Haskell strings. Three use a recursive function with an accumulator parameter and three use a foldl with a lambda function.

Say the fastest one takes the time 1. How much time will the others take?

And how about memory? How much memory do you think they require? Let's say we feed a 150MB(*) file into each of them, how many megabytes do you think they end up using (as seen from the OS, not in terms of how big the live heap is)?

I'm going to post full benchmarks + analysis on Wednesday.

How are you compiling these programs, by the way? ghc-6.8.2 -O2 ? (-O2 is required for bytestrings :) -- Don

Peter Lund

10:44 p.m.

On Sat, 2007-12-15 at 11:59 -0800, Don Stewart wrote:

...

firefly:

...
What do you think the relative speeds are of the six small haskell programs at the end of this email?

All they do is read from stdin and count the number of spaces they see. There are two that use strict bytestrings, two that use lazy bytestrings, and two that use the standard Haskell strings. Three use a recursive function with an accumulator parameter and three use a foldl with a lambda function.

Say the fastest one takes the time 1. How much time will the others take?

And how about memory? How much memory do you think they require? Let's say we feed a 150MB(*) file into each of them, how many megabytes do you think they end up using (as seen from the OS, not in terms of how big the live heap is)?

I'm going to post full benchmarks + analysis on Wednesday.

How are you compiling these programs, by the way? ghc-6.8.2 -O2 ? (-O2 is required for bytestrings :)

With -O2. I have measured with 6.8.1, 6.9.20071119, 6.9.20071208 (approx), and 6.9.200712xx (as of yesterday or today). The picture changes very little with the compiler, if it changes at all. I have run them on three very different microarchitectures (2GHz Athlon64 3000+, 1667MHz Core Duo, 600MHz Pentium III). All the measurements are scripted. 'make phase1' compiles the benchmarks, creates input files of various sizes, and runs each benchmark once to gather information about memory use (peak RSS + ghc RTS' own information about allocations and gcs), page faults, and an strace. This phase is not timing sensitive so I can browse the web and listen to the music etc. while running it. 'make phase2' runs each benchmark a number of times, calculates the average time for each + the relative size of the standard deviation + how much user+sys is different from real (as reported by bash' built-in time command). A report with barcharts indicating relative time and relative peak RSS is generated in either pure ASCII or in UTF-8 (with fractional-width block chars so the charts look nice and have high resolution). If the measurements are deemed to be bad (too high standard deviation or user+sys doesn't add up to real) then the barchart is done with '%' characters. The "quality indicators" for each timing test are always printed out next to each bar, so we know how close we are to being perfect or, conversely, how bad the measurements are. There is a script that analyzes the I/O pattern and sums it up (as "4375 x read(3, 4096, ...) = 4096 followed by 1 x read(3, 4096, ...) = 1242 followed by 1 x read(3, 4096, ...) = 0" an similar). There are a set of simple I/O programs in C so we can compare ghc's performance with "speed of light", and so different I/O strategies can be compared in a cleaner, purer form. There are also 'make cache' (runs cachegrind), 'make hs/xxxx.doc' (creates a file with source code + core + stg + c-- + assembly + times for a given benchmark), etc. 'make sysinfo' creates a file with information about the Linux distribution used (/etc/lsb-release), kernel version (uname -a), CPU (/proc/cpuinfo), and compilers used (ghc --version, gcc --version). 'make zipdata' creates a zip file of about 20K with all the raw time measurements + the sysinfo. I also have a set of scripts that installs each ghc version so 1) it is easy for me to repeat the tests on various machines and 2) you can see exactly which ghc versions I use and repeat the experiments yourself. -Peter

Don Stewart

16 Dec 16 Dec

3:07 a.m.

firefly:

...

What do you think the relative speeds are of the six small haskell programs at the end of this email?

All they do is read from stdin and count the number of spaces they see. There are two that use strict bytestrings, two that use lazy bytestrings, and two that use the standard Haskell strings. Three use a recursive function with an accumulator parameter and three use a foldl with a lambda function.

Say the fastest one takes the time 1. How much time will the others take?

And how about memory? How much memory do you think they require? Let's say we feed a 150MB(*) file into each of them, how many megabytes do you think they end up using (as seen from the OS, not in terms of how big the live heap is)?

I'm going to post full benchmarks + analysis on Wednesday.

-Peter

*) hardddisk megabytes. The file is 150000034 bytes ≈ 143 mebibytes.

PS: For extra credit, what do you think is the peak memory use for this program when given an input file of 150MB?

Well, I'm not going to wait till Wednesday for the numbers! $ ghc --version The Glorious Glasgow Haskell Compilation System, version 6.8.2 $ du -hs 150M 150M ------------------------------------------------------------------------ Program 1: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B import GHC.Int (Int64) -- note that D.BS.Lazy.Char8.length is ByteString -> Int64 -- D.BS.C8.length is ByteString -> Int cnt :: B.ByteString -> Int64 cnt bs = B.length (B.filter (== ' ') bs) main = do s <- B.getContents print (cnt s) Ok, so lazy bytestrings. Should be constant space use, but two traversals (since there's no lazy bytestring fusion, over each lazy chunk). Not perfect, but should be ok. length will consume chunks as they're produced by filter. ** Prediction: Constant ~3M space use (runtime, plus small overhead) Fast, due to chunk-wise processing. ** Result: $ ghc -O2 A.hs -o A --make $ time ./A < 150M +RTS -sstderr ./A +RTS -sstderr < 150M 1.01s user 0.10s system 98% cpu 1.123 total And top says 40M allocated. ** Summary: Ok, pretty fast, but an unexpected(!) amount of memory allocated. Now, this memory result is suspicious, I wonder if the now obsolete 'array fusion' is messing things up. In Data.ByteString.Lazy, we have: filter p = F.loopArr . F.loopL (F.filterEFL p) F.NoAcc We keep meaning to replace this stuff with the stream fusion mechanisms, which compile a lot better. Perhaps the time has come to look at that :) I'll put this memory allocation down as a bug that needs to be looked at. ------------------------------------------------------------------------ Program 2: hs/space-bs-c8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) Strict bytestrings, and you manually fuse the length/filter calculation. Allocating all that memory will cost you. ** Prediction: O(N) memory, around 150M allocated Slower, due to cache effects (more data to traverse) and more indirections. ** Result top says 154M ./B +RTS -sstderr < 150M 1.10s user 0.52s system 111% cpu 1.454 total ** Summary: Seems reasonable, when its doing all that allocation. ------------------------------------------------------------------------ Program 2a: But we could easily improve this program: Since: length (filter (== ' ') == length (filterByte ' ' == count ' ' And we have: import qualified Data.ByteString.Char8 as B cnt :: B.ByteString -> Int cnt x = B.count ' ' x main = do s <- B.getContents print (cnt s) ** Prediction: Which should be a bit faster. ** Result: $ time ./B < 150M 24569024 ./B < 150M 0.66s user 0.55s system 113% cpu 1.070 total ** Summary: So that's the fastest program so far. The rewrite rules to do these transformatoins used to be enabled, but need looking at again. There should also be no real benefit to manually fuse a length . filter loop like this, however, the old fusion system used in bytestring might have some small overhead here. This also needs looking at. ------------------------------------------------------------------------ Program 2b: We can do even better if we read the file in: import qualified Data.ByteString.Char8 as B import System.IO.Posix.MMap main = print . B.count ' ' =<< mmapFile "150M" ** Prediction: super fast ** Result: $ time ./B2 24569024 ./B2 0.31s user 0.01s system 101% cpu 0.314 total (similar results if you use vanilla B.readFile too, fwiw). Summary: This suggests to me we could look again at how strings of unknown size are read in. ------------------------------------------------------------------------ Program 3: hs/space-bslc8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) Ok, for lazy bytestrings. ** Prediction: Should run in constant space, but there are more checks than in the unfused length/filter case. It should be a bit slower, and hopefully run in constant space. ** Result: $ time ./C < 150M 24569024 ./C < 150M 2.36s user 0.11s system 99% cpu 2.489 total top says 3804K ** Summary: So it runs in the 3M constant space I'd expected the original program to run in, but its a fair bit slower. The generated code looks pretty good though. Investigate why this is slower. ------------------------------------------------------------------------ Program 4: ============================== hs/space-xxxxx-acc-1.hs: {-# LANGUAGE BangPatterns #-} cnt :: Int -> String -> Int cnt !acc bs = if null bs then acc else cnt (if head bs == ' ' then acc+1 else acc) (tail bs) main = do s <- getContents print (cnt 0 s) ** Prediction: Lazy, so constant space, but probably 10x slower than the previous program. ** I'm not going to bother with this one, since strings suck for large data. ------------------------------------------------------------------------ Program 5: hs/space-bs-c8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) Ok, a strict foldl'. ** Prediction: Should be similar to the strict bytestring first example, and a little faster due to no redundant null checks. ** Result $ time ./D < 150M 24569024 ./D < 150M 1.02s user 0.58s system 111% cpu 1.436 tota ** Summary: As expected ------------------------------------------------------------------------ Program 6: ============================== hs/space-bslc8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) A strict fold over a lazy bytestring. Prediction: Ok, constant memory use, and should be similar in speed to the first length. filter program. It'll be faster than manual accumulating lazy bytestring program, since the redundant null checks are gone from the inner loop. Result: $ time ./E < 150M 24569024 ./E < 150M 0.84s user 0.07s system 98% cpu 0.928 total 2468k memory. Wow. Didn't expect it to be that fast. So properly lazy, and nicely fast. This is what we'd hope to see :) It also re-enforces that there's a bug in the length . filter program. ------------------------------------------------------------------------ Program 7: ============================== hs/space-xxxxx-foldl.hs: {-# LANGUAGE BangPatterns #-} cnt :: String -> Int cnt bs = foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- getContents print (cnt s) Hmm. Lazy accumulator eh, on String? Should exhibit a space leak. Not worth running.. ********************** Summary, * Program 1 is fast, as expected, but exhbits a bug in the bytestring library's lazy bytestring fusion system. Something in length or filter isn't doing the right job. This code will be replaced by the stream fusion system soon. * Program 2: as expected. strict IO uses O(N) space, and that has performance effects. * Program 3: lazy bytestrings use constant space, but you better avoid redundant bounds checks in the inner loops. * Program 4: strings are silly * Program 5: as expected. similar to program 2. * Program 6: strict foldl's over lazy bytestrings are good :) fast, and constant space. * Program 7: see program 4. Pretty much as expected then, but with a bug identified in lazy bytestring fusion (I think). Nice little benchmark. -- Don

Daniel Fischer

3:53 a.m.

Am Sonntag, 16. Dezember 2007 04:07 schrieb Don Stewart:

...

------------------------------------------------------------------------ Program 7:

============================== hs/space-xxxxx-foldl.hs: {-# LANGUAGE BangPatterns #-}

cnt :: String -> Int cnt bs = foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs

main = do s <- getContents print (cnt s)

Hmm. Lazy accumulator eh, on String? Should exhibit a space leak.

Doesn't (with -O2, at least), seems ghc's strictness analyser did a good job. It is indeed about 10* slower than ByteStrings, but very memory friendly - and, actually on my machine it's faster (not much) than Data.List.foldl' . And, again on my machine, Programme 3 is almost as slow when compiled with 6.8.1 and twice as slow when compiled with 6.6.1.

...

Nice little benchmark.

-- Don

Cheers, Daniel

Peter Lund

4:12 a.m.

On Sun, 2007-12-16 at 04:53 +0100, Daniel Fischer wrote:

...

...
Hmm. Lazy accumulator eh, on String? Should exhibit a space leak.

Doesn't (with -O2, at least), seems ghc's strictness analyser did a good job. It is indeed about 10* slower than ByteStrings, but very memory friendly -

Daniel is right, there's no space leak. Try it. You'll get a nice surprise :) -Peter

Don Stewart

5:18 a.m.

firefly:

...

On Sun, 2007-12-16 at 04:53 +0100, Daniel Fischer wrote:

...
...
Hmm. Lazy accumulator eh, on String? Should exhibit a space leak.

Doesn't (with -O2, at least), seems ghc's strictness analyser did a good job. It is indeed about 10* slower than ByteStrings, but very memory friendly -

Daniel is right, there's no space leak.

Try it. You'll get a nice surprise :)

Very nice. If we disable the strictness analsyer, $ ghc -O2 -fno-strictness A.hs -no-recomp -o A We get a core loop that looks like: lgo :: Int -> [Char] -> Int lgo n xs = case xs of [] -> n a : as -> lgo (case a of -- sad dons C# c1_amH -> case c1_amH of { ' ' -> case n of I# i -> I# (i +# 1) _ -> n; ) as Look at that big lazy expression for 'n' not being forced! And when run: 1015M 1017M onproc/1 - 0:05 22.36% A Scary stuff. Lots of Int thunks :) But enabling the strictness analyser: lgo :: Int# -> [Char] -> Int# lgo (n :: Int#) (xs :: [Char]) = case xs of [] -> n a : as -> case a of C# c -> case c of ' ' -> lgo (n +# 1) as -- makes me happy _ -> lgo n as And life is good again :) What is quite amazing is how efficient this program is. I had to rerun this a dozen or so times, since I didn't quite believe it: $ time ./A < /usr/obj/data/150M 24569024 ./A < /usr/obj/data/150M 2.42s user 0.47s system 100% cpu 2.883 total Pretty stunning, I think. Swapping in a slightly more eager structure, the lazy ByteString, $ time ./A < /usr/obj/data/150M 24569024 ./A < /usr/obj/data/150M 0.86s user 0.07s system 98% cpu 0.942 total improves things by a good amount, but I think we can revisit the low level performance of lazy bytestrings again, in light of all the changes to the optimiser in the past 2 years. -- Don

Peter Lund

5:36 a.m.

On Sat, 2007-12-15 at 21:18 -0800, Don Stewart wrote:

...

What is quite amazing is how efficient this program is.

Yep. I was very surprised too. Turns out there *was* a reason to run those string benchmarks, eh? ;)

...

improves things by a good amount, but I think we can revisit the low level performance of lazy bytestrings again, in light of all the changes to the optimiser in the past 2 years.

Yep. -Peter

Peter Lund

5:45 a.m.

On Sat, 2007-12-15 at 19:07 -0800, Don Stewart wrote:

...

Well, I'm not going to wait till Wednesday for the numbers!

But I got you to write down predictions before you ran the tests, which is precisely what I wanted :)

...

Summary,

* Program 1 is fast, as expected, but exhbits a bug in the bytestring library's lazy bytestring fusion system. Something in length or filter isn't doing the right job. This code will be replaced by the stream fusion system soon.

Good.

...

* Program 2: as expected. strict IO uses O(N) space, and that has performance effects.

* Program 3: lazy bytestrings use constant space, but you better avoid redundant bounds checks in the inner loops.

Maybe its extra bounds-checking that makes it slow, as you say. It probably is. I must admit that I couldn't follow the core/stg/C--/assembly code at all.

...

* Program 4: strings are silly

No they are not. They are the default data structure for text and give the baseline that bytestrings should beat. I find it interesting to see *if* bytestrings beat it and if so, by how much. The vanilla string versions of my tests all use less memory than any of the other versions, but they are a bit slower. And perhaps not as much slower as they should be...

...

* Program 5: as expected. similar to program 2.

* Program 6: strict foldl's over lazy bytestrings are good :) fast, and constant space.

* Program 7: see program 4.

Pretty much as expected then, but with a bug identified in lazy bytestring fusion (I think).

Nice little benchmark.

Thanks :) There are more... -Peter

Don Stewart

11:21 p.m.

I've had a look at how some of the code was being compiled for strict and lazy bytestrings, and also which rules weren't firing. With some small tweaks the code seems back in good shape. An updated bytestring library is at : http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-0.9.0.... Enjoy! :) ------------------------------------------------------------------------ Summary: the suspicious lazy bytestring program works now. (constant space, and fastest overall, as expected originally) Program 1, lazy bytestring length . filter Yesterday: ./A +RTS -sstderr < 150M 1.01s user 0.10s system 98% cpu 1.123 total 40M allocated * Today (fixed!): ./A +RTS -sstderr < 150M 0.26s user 0.06s system 96% cpu 0.332 total 2M allocated Reason, deprecated array fusion mucking up the optimiser. I think we can close this regression. ------------------------------------------------------------------------ Also, I had a look at Program 3: lazy bytestring, custom loop Unchanged. 2.4s, constant space. This was a bit slow. Further investigation shows lots of unnecessary bounds checks, as we take apart the Chunk lazy bytestring type, then test and continue. This representation was chosen to make it possible to process chunks efficiently, so that we can avoid these bounds check. Something like this instead: cnt :: Int -> B.ByteString -> Int cnt n B.Empty = n cnt n (B.Chunk x xs) = cnt (n + cnt_strict 0 x) xs -- process lazy spine -- now we can process a chunk without checking for Empty where cnt_string !i !s -- then strict chunk | S.null s = i | c == ' ' = cnt_strict (i+1) t | otherwise = cnt_strict i t where (c,t) = (S.w2c (S.unsafeHead s), S.unsafeTail s) -- no bounds check main = do s <- B.getContents; print (cnt 0 s) Let's us avoid redundant checks for Empty, while allowing 'go' to avoid unnecessary checks for the empty strict bytestring. This is some 4x faster. This alternating between lazy spines and strict chunk processing is the best way to get reliable performance from lazy bytestring custom loops. -- Don

Peter Lund

17 Dec 17 Dec

8:55 a.m.

On Sun, 2007-12-16 at 15:21 -0800, Don Stewart wrote:

...

An updated bytestring library is at :

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-0.9.0....

Enjoy! :)

Thanks! -Peter

Simon Peyton-Jones

20 Dec 20 Dec

10:37 a.m.

New subject: Haskell performance

Don, and others, This thread triggered something I've had at the back of my mind for some time. The traffic on Haskell Cafe suggests that there is a lot of interest in the performance of Haskell programs. However, at the moment we don't have any good *performance* regression tests for GHC. We have zillions of behavioural regression tests (this program should compile, this one should fail), but nothing much on performance. We have the nofib suite, but it's pretty static these days. Peter's set of benchmarks are great (if very specific to strings etc, but that's fine), and it'd be a pity of they now sink beneath the waves. What would be v helpful would be a regression suite aimed at performance, that benchmarked GHC (and perhaps other Haskell compilers) against a set of programs, regularly, and published the results on a web page, highlighting regressions. Kind of like the Shootout, only just for Haskell, and with many more programs. Like Hackage, it should be easy to add a new program. It'd be good to measure run-time, but allocation count, peak memory use, code size, compilation time are also good (and rather more stable) numbers to capture. Does anyone feel like doing this? It'd be a great service. No need to know anything much about GHC. Simon | -----Original Message----- | From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Don Stewart | Sent: 16 December 2007 23:22 | To: Peter Lund | Cc: haskell-cafe | Subject: Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser | | I've had a look at how some of the code was being compiled for | strict and lazy bytestrings, and also which rules weren't firing. | With some small tweaks the code seems back in good shape.

Lutz Donnerhacke

10:56 a.m.

New subject: Haskell performance

* Simon Peyton-Jones wrote:

...

Does anyone feel like doing this? It'd be a great service. No need to know anything much about GHC.

I'd like to do that. For a lecture I'm already generated performance tests for various sorting algorithms. It's designed about a function "performance :: Size -> IO RunsPerSecond". So with unsafePerformIO it is a good candidate for quickCheck.

Malcolm Wallace

10:58 a.m.

New subject: Haskell performance

Simon Peyton-Jones wrote:

...

What would be v helpful would be a regression suite aimed at performance, that benchmarked GHC (and perhaps other Haskell compilers) against a set of programs, regularly, and published the results on a web page, highlighting regressions.

Something along these lines already exists - the nobench suite. darcs get http://www.cse.unsw.edu.au/~dons/code/nobench It originally compared ghc, ghci, hugs, nhc98, hbc, and jhc. (Currently the results at http://www.cse.unsw.edu.au/~dons/nobench.html compare only variations of ghc fusion rules.) I have just been setting up my own local copy - initial results at http://www.cs.york.ac.uk/fp/nobench/powerpc/results.html where I intend to compare ghc from each of the 6.4, 6.6 and 6.8 branches, against nhc98 and any other compilers I can get working. I have powerpc, intel, and possibly sparc machines available.

...

Like Hackage, it should be easy to add a new program.

Is submitting a patch against the darcs repo sufficiently easy? Should we move the master darcs repo to somewhere more accessible, like code.haskell.org?

...

It'd be good to measure run-time,

Done...

...

but allocation count, peak memory use, code size, compilation time are also good (and rather more stable) numbers to capture.

Nobench does already collect code size, but does not yet display it in the results table. I specifically want to collect compile time as well. Not sure what the best way to measure allocation and peak memory use are? Regards, Malcolm

Lutz Donnerhacke

11:52 a.m.

New subject: Haskell performance

* Malcolm Wallace wrote:

...

Something along these lines already exists - the nobench suite.

Ok, your turn.

Simon Marlow

1:16 p.m.

New subject: Haskell performance

Malcolm Wallace wrote:

...

Simon Peyton-Jones wrote:

...
What would be v helpful would be a regression suite aimed at performance, that benchmarked GHC (and perhaps other Haskell compilers) against a set of programs, regularly, and published the results on a web page, highlighting regressions.

Something along these lines already exists - the nobench suite. darcs get http://www.cse.unsw.edu.au/~dons/code/nobench It originally compared ghc, ghci, hugs, nhc98, hbc, and jhc. (Currently the results at http://www.cse.unsw.edu.au/~dons/nobench.html compare only variations of ghc fusion rules.)

I have just been setting up my own local copy - initial results at http://www.cs.york.ac.uk/fp/nobench/powerpc/results.html where I intend to compare ghc from each of the 6.4, 6.6 and 6.8 branches, against nhc98 and any other compilers I can get working. I have powerpc, intel, and possibly sparc machines available.

That's great. BTW, GHC has a performance bug affecting calendar at the moment: http://hackage.haskell.org/trac/ghc/ticket/1168 The best GHC options for this program might therefore be -O2 -fno-state-hack. Or perhaps just -O0.

...

...
Like Hackage, it should be easy to add a new program.

Is submitting a patch against the darcs repo sufficiently easy? Should we move the master darcs repo to somewhere more accessible, like code.haskell.org?

Yes, please do. When I have a chance I'd like to help out.

...

...
It'd be good to measure run-time,

Done...

...
but allocation count, peak memory use, code size, compilation time are also good (and rather more stable) numbers to capture.

Nobench does already collect code size, but does not yet display it in the results table. I specifically want to collect compile time as well. Not sure what the best way to measure allocation and peak memory use are?

With GHC you need to use "+RTS -s" and then slurp in the <prog>.stat file. You can also get allocations, peak memory use, and separate mutator/GC times this way. Cheers, Simon

Simon Marlow

1:22 p.m.

New subject: Haskell performance

Simon Marlow wrote:

...

...
Nobench does already collect code size, but does not yet display it in the results table. I specifically want to collect compile time as well. Not sure what the best way to measure allocation and peak memory use are?

With GHC you need to use "+RTS -s" and then slurp in the <prog>.stat file. You can also get allocations, peak memory use, and separate mutator/GC times this way.

Oh, and one more thing. We have this program called nofib-analyse in GHC's source tree: http://darcs.haskell.org/ghc/utils/nofib-analyse which takes the output from a couple of nofib runs and generates nice tables, in ASCII or LaTeX (for including in papers, see e.g. our pointer-tagging paper from ICFP'07). The only reason we haven't switched to using nobench for GHC is the existence of this tool. Unfortuantely it relies on specifics of the output generated by a nofib run, and uses a Perl script, etc. etc. The point is, it needs some non-trivial porting. I'm pointing this out just in case you or anyone else felt enthusiastic enough to port this to nobench, and to hopefully head off any duplication of effort. Failing that, I'll probably get around to porting it myself at some point. Cheers, Simon

Don Stewart

7:04 p.m.

New subject: Haskell performance

simonmarhaskell:

...

Malcolm Wallace wrote:

...
Simon Peyton-Jones wrote:

...
What would be v helpful would be a regression suite aimed at performance, that benchmarked GHC (and perhaps other Haskell compilers) against a set of programs, regularly, and published the results on a web page, highlighting regressions.

Something along these lines already exists - the nobench suite. darcs get http://www.cse.unsw.edu.au/~dons/code/nobench It originally compared ghc, ghci, hugs, nhc98, hbc, and jhc. (Currently the results at http://www.cse.unsw.edu.au/~dons/nobench.html compare only variations of ghc fusion rules.)

I have just been setting up my own local copy - initial results at http://www.cs.york.ac.uk/fp/nobench/powerpc/results.html where I intend to compare ghc from each of the 6.4, 6.6 and 6.8 branches, against nhc98 and any other compilers I can get working. I have powerpc, intel, and possibly sparc machines available.

That's great. BTW, GHC has a performance bug affecting calendar at the moment:

http://hackage.haskell.org/trac/ghc/ticket/1168

The best GHC options for this program might therefore be -O2 -fno-state-hack. Or perhaps just -O0.

...
...
Like Hackage, it should be easy to add a new program.

Is submitting a patch against the darcs repo sufficiently easy? Should we move the master darcs repo to somewhere more accessible, like code.haskell.org?

Yes, please do. When I have a chance I'd like to help out.

Malcolm, can you just take the darcs repo, and move it to code.haskell.org? -- Don

Don Stewart

7:03 p.m.

New subject: Haskell performance

Malcolm.Wallace:

...

Simon Peyton-Jones wrote:

...
What would be v helpful would be a regression suite aimed at performance, that benchmarked GHC (and perhaps other Haskell compilers) against a set of programs, regularly, and published the results on a web page, highlighting regressions.

Something along these lines already exists - the nobench suite. darcs get http://www.cse.unsw.edu.au/~dons/code/nobench It originally compared ghc, ghci, hugs, nhc98, hbc, and jhc. (Currently the results at http://www.cse.unsw.edu.au/~dons/nobench.html compare only variations of ghc fusion rules.)

I have just been setting up my own local copy - initial results at http://www.cs.york.ac.uk/fp/nobench/powerpc/results.html where I intend to compare ghc from each of the 6.4, 6.6 and 6.8 branches, against nhc98 and any other compilers I can get working. I have powerpc, intel, and possibly sparc machines available.

...
Like Hackage, it should be easy to add a new program.

Is submitting a patch against the darcs repo sufficiently easy? Should we move the master darcs repo to somewhere more accessible, like code.haskell.org?

...
It'd be good to measure run-time,

Done...

...
but allocation count, peak memory use, code size, compilation time are also good (and rather more stable) numbers to capture.

Nobench does already collect code size, but does not yet display it in the results table. I specifically want to collect compile time as well. Not sure what the best way to measure allocation and peak memory use are?

Yeah, this is hard. There are various non-portable perl scripts for this kind of thing, or +RTS -sstderr -- Don

Ian Lynagh

23 Dec 23 Dec

11:52 a.m.

New subject: Haskell performance

On Thu, Dec 20, 2007 at 10:58:17AM +0000, Malcolm Wallace wrote:

...

Nobench does already collect code size, but does not yet display it in the results table. I specifically want to collect compile time as well. Not sure what the best way to measure allocation and peak memory use are?

This: http://lists.osuosl.org/pipermail/darcs-devel/2006-January/004016.html should be Haskell-implementation-independent, but is probably Linux-specific. Adapting it to other Unix-like OSes is probably easy, but I have no idea about Windows. Thanks Ian

Peter Lund

24 Dec 24 Dec

9:48 a.m.

New subject: Haskell performance

On Sun, 2007-12-23 at 11:52 +0000, Ian Lynagh wrote:

...

On Thu, Dec 20, 2007 at 10:58:17AM +0000, Malcolm Wallace wrote:

...
Nobench does already collect code size, but does not yet display it in the results table. I specifically want to collect compile time as well. Not sure what the best way to measure allocation and peak memory use are?

This: http://lists.osuosl.org/pipermail/darcs-devel/2006-January/004016.html should be Haskell-implementation-independent, but is probably Linux-specific. Adapting it to other Unix-like OSes is probably easy, but I have no idea about Windows.

Very nice. A short-term improvement would perhaps be to use ptrace() to also sample the program counter register? On a longer-term scale, I wonder how hard it would be to implement a valgrind skin to get much more precise heap-use information... -Peter

Peter Lund

20 Dec 20 Dec

11:07 a.m.

New subject: Haskell performance

On Thu, 2007-12-20 at 10:37 +0000, Simon Peyton-Jones wrote:

...

Don, and others,

This thread triggered something I've had at the back of my mind for some time.

The traffic on Haskell Cafe suggests that there is a lot of interest in the performance of Haskell programs. However, at the moment we don't have any good *performance* regression tests for GHC. We have zillions of behavioural regression tests (this program should compile, this one should fail), but nothing much on performance. We have the nofib suite, but it's pretty static these days. Peter's set of benchmarks are great (if very specific to strings etc, but that's fine), and it'd be a pity of they now sink beneath the waves.

They won't! I have set up a mercurial repository on http://vax64.dyndns.org/repo/hg/ together with the ghc install scripts I've used. Once the basic string performance is under control, I intend to expand it with more advanced parsing, with I/O, and with backend stuff. I like Parsec. But it seems to hang on to a bit more memory than it should and I think it should be faster than it is. Fast I/O is not simple, and to do it really well, one probably needs to use threading and mmap() in combination. mmap() alone is usually not very performant unless the file has already been cached by the operating system. And the backend. Ouch. The frontend is absolutely fantastic and does heroic stuff -- but the backend... apart from having many phases, it doesn't do much ;)

...

What would be v helpful would be a regression suite aimed at performance, that benchmarked GHC (and perhaps other Haskell compilers) against a set of programs, regularly, and published the results on a web page, highlighting regressions. Kind of like the Shootout, only just for Haskell, and with many more programs.

I don't see why a lot of that couldn't be added to the framework I have. It's GPLv2 :)

...

Like Hackage, it should be easy to add a new program. It'd be good to measure run-time, but allocation count, peak memory use, code size,

My framework captures the allocation count but it doesn't use it for anything. It gets its peak memory info from /proc/self/status (which it captures, together with /proc/self/maps, through a LD_PRELOAD trick). '-sstderr' seemed a bit unreliable in my experience, so I fell back to asking the operating system. Making sure one gets stable times + a good estimate of the quality of the measurements is also important (which my code already does).

...

compilation time are also good (and rather more stable) numbers to capture.

Does anyone feel like doing this? It'd be a great service. No need to know anything much about GHC.

I think I've made a start but this is clearly not something I'm willing to take on by myself. -Peter

Don Stewart

7:02 p.m.

New subject: Haskell performance

simonpj:

...

Don, and others,

This thread triggered something I've had at the back of my mind for some time.

The traffic on Haskell Cafe suggests that there is a lot of interest in the performance of Haskell programs. However, at the moment we don't have any good *performance* regression tests for GHC. We have zillions of behavioural regression tests (this program should compile, this one should fail), but nothing much on performance. We have the nofib suite, but it's pretty static these days. Peter's set of benchmarks are great (if very specific to strings etc, but that's fine), and it'd be a pity of they now sink beneath the waves.

What would be v helpful would be a regression suite aimed at performance, that benchmarked GHC (and perhaps other Haskell compilers) against a set of programs, regularly, and published the results on a web page, highlighting regressions. Kind of like the Shootout, only just for Haskell, and with many more programs.

Like Hackage, it should be easy to add a new program. It'd be good to measure run-time, but allocation count, peak memory use, code size, compilation time are also good (and rather more stable) numbers to capture.

Does anyone feel like doing this? It'd be a great service. No need to know anything much about GHC.

Ok, so I should revive nobench then, I suspect. http://www.cse.unsw.edu.au/~dons/nobench/x86_64/results.html that kind of thing? I'll see now far I can get updating the graph for the current suite of compilers. -- Don

Jon Harrop

7:19 p.m.

New subject: Haskell performance

On Thursday 20 December 2007 19:02, Don Stewart wrote:

...

Ok, so I should revive nobench then, I suspect.

http://www.cse.unsw.edu.au/~dons/nobench/x86_64/results.html

that kind of thing?

Many of those benchmarks look good. However, I suggest avoiding trivially reducible problems like computing constants (e, pi, primes, fib) and redundant operations (binary trees). Make sure programs accept a non-trivial input (even if it is just an int over a wide range). Avoid unnecessary repeats (e.g. atom.hs). This will mean that transformations that improve performance on the benchmark suite will be more likely to improve the performance of real programs. I would recommend adding: 1. FFT. 2. Graph traversal, e.g. "n"th-nearest neighbor. These should be <100LOC each. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?e

Don Stewart

7:34 p.m.

New subject: Haskell performance

jon:

...

On Thursday 20 December 2007 19:02, Don Stewart wrote:

...
Ok, so I should revive nobench then, I suspect.

http://www.cse.unsw.edu.au/~dons/nobench/x86_64/results.html

that kind of thing?

Many of those benchmarks look good.

However, I suggest avoiding trivially reducible problems like computing constants (e, pi, primes, fib) and redundant operations (binary trees). Make sure programs accept a non-trivial input (even if it is just an int over a wide range). Avoid unnecessary repeats (e.g. atom.hs). This will mean that transformations that improve performance on the benchmark suite will be more likely to improve the performance of real programs.

This is a long recognised issue. The benchmark suite is a variant of the nofib suite, described here: http://citeseer.ist.psu.edu/partain93nofib.html which breaks the programs up into imaginary, spectral and real categories of programs.

...

I would recommend adding:

1. FFT.

2. Graph traversal, e.g. "n"th-nearest neighbor.

These should be <100LOC each.

Sounds good. Patches can be sent via darcs. -- Don

Thomas DuBuisson

9:02 p.m.

New subject: Haskell performance

...

However, I suggest avoiding trivially reducible problems like computing constants (e, pi, primes, fib) and redundant operations (binary trees). Make sure programs accept a non-trivial input (even if it is just an int over a wide range). Avoid unnecessary repeats (e.g. atom.hs). This will mean that transformations that improve performance on the benchmark suite will be more likely to improve the performance of real programs.

May I suggest using pureMD5 as one of the benchmarks? It makes sense in my mind that a tool with a real use and consistent run times be used. Not sure if I would consider the rolled (concise) or unrolled (ugly) version a better fit (or both). The rolled version showed excellent speed increase due to pointer tagging (17sec down from 27sec for hashing 200MB). On the other hand, the unrolled version is the obvious 'make as ugly a Haskell program as you can to complete with C'.

Henning Thielemann

8:48 p.m.

New subject: Haskell performance

On Thu, 20 Dec 2007, Jon Harrop wrote:

...

On Thursday 20 December 2007 19:02, Don Stewart wrote:

...
Ok, so I should revive nobench then, I suspect.

http://www.cse.unsw.edu.au/~dons/nobench/x86_64/results.html

that kind of thing?

Many of those benchmarks look good.

However, I suggest avoiding trivially reducible problems like computing constants (e, pi, primes, fib) and redundant operations (binary trees). Make sure programs accept a non-trivial input (even if it is just an int over a wide range). Avoid unnecessary repeats (e.g. atom.hs). This will mean that transformations that improve performance on the benchmark suite will be more likely to improve the performance of real programs.

I would recommend adding:

1. FFT.

http://hackage.haskell.org/packages/archive/dsp/0.2/doc/html/Numeric-Transfo...

...

2. Graph traversal, e.g. "n"th-nearest neighbor.

Isaac Dupree

21 Dec 21 Dec

8:57 p.m.

New subject: Haskell performance

Jon Harrop wrote:

...

On Thursday 20 December 2007 19:02, Don Stewart wrote:

...
Ok, so I should revive nobench then, I suspect.

http://www.cse.unsw.edu.au/~dons/nobench/x86_64/results.html

that kind of thing?

Many of those benchmarks look good.

However, I suggest avoiding trivially reducible problems like computing constants (e, pi, primes, fib)

true in general, but I know I use computed constants in real code because it's cleaner than using their expanded value, so it's worth checking whether a compiler can (still) do (how much of) that. Isaac

ajb＠spamcop.net

22 Dec 22 Dec

11:14 a.m.

New subject: Haskell performance

G'day all. Quoting Jon Harrop :

...

I would recommend adding:

1. FFT.

2. Graph traversal, e.g. "n"th-nearest neighbor.

I'd like to put in a request for Pseudoknot. Does anyone still have it? Cheers, Andrew Bromage

Simon Marlow

2 Jan 2 Jan

3:30 p.m.

New subject: Haskell performance

ajb@spamcop.net wrote:

...

G'day all.

Quoting Jon Harrop :

...
I would recommend adding:

1. FFT.

2. Graph traversal, e.g. "n"th-nearest neighbor.

I'd like to put in a request for Pseudoknot. Does anyone still have it?

This is it, I think: http://darcs.haskell.org/nofib/spectral/hartel/nucleic2 Cheers, Simon

Roman Leshchinskiy

16 Dec 16 Dec

10:33 p.m.

Don Stewart wrote:

...

cnt :: B.ByteString -> Int64 cnt bs = B.length (B.filter (== ' ') bs)

[...]

Now, this memory result is suspicious, I wonder if the now obsolete 'array fusion' is messing things up. In Data.ByteString.Lazy, we have:

Are you sure you have a fusible length? I think I only added it to NDP after stream fusion went in. Roman

Don Stewart

10:46 p.m.

rl:

...

Don Stewart wrote:

...
cnt :: B.ByteString -> Int64 cnt bs = B.length (B.filter (== ' ') bs)

[...]

Now, this memory result is suspicious, I wonder if the now obsolete 'array fusion' is messing things up. In Data.ByteString.Lazy, we have:

Are you sure you have a fusible length? I think I only added it to NDP after stream fusion went in.

It was the array fusion from prior to the stream stuff (foldl . filter). Which was in fact messing up the simplifier. I've fixed this, (turned off array fusion for now), and things are back to normal. (well, much faster, actually). -- Don

6393

Age (days ago)

6411

Last active (days ago)

List overview

Download

33 comments

15 participants

participants (15)

ajb＠spamcop.net
Daniel Fischer
Don Stewart
Duncan Coutts
Henning Thielemann
Ian Lynagh
Isaac Dupree
Jon Harrop
Lutz Donnerhacke
Malcolm Wallace
Peter Lund
Roman Leshchinskiy
Simon Marlow
Simon Peyton-Jones
Thomas DuBuisson