Using loop fusion for writing efficent high-level code Re[2]: [Haskell] [Fwd: Re: Computer Language Shootout]

Hello Duncan, Tuesday, February 27, 2007, 1:09:53 PM, you wrote: can you please provide examples of such code? the ultimate goal is to have tutorial on writing efficient code in high-level manner, but that is too ambitious. just some examples with a little explanation of how this is compiled? thanks in advance
So I think we should do the same. It even shows in the Shootout - the programs that are simultaneously fastest and clearest are not pure Haskell, but delegate their innermost loops to tuned C libraries (FPS and GMP).
I should note that FPS is almost completely Haskell code, not C. We use C for things like memcmp, memcpy, memset, reverse_copy, intersperse, maximum, minimum and count.
Certainly some of the innards are low level style Haskell, though not the kind that could be replicated in C because we use high level transformations to fuse loop bodies together and wrap them in high performance, low level loop code.
This is not the style where we just wrap well tuned C code, this is a style where we generate high performance low level code from a high level spec. This relies on GHC's excellent and programmable optimiser.
It's wrong to say that the shootout improvements were only down to improved libraries. The performance of ByteString code improved very significantly between GHC 6.4 and 6.6 and a large part of that was down to optimiser improvements (not just the ForeignPtr rep change).
Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
can you please provide examples of such code?
I'd recommend taking a look at the new binary package. It's very cleanly written, and mostly easy to understand. It's also easy to see where the optimisations have been made. The only part that might induce sudden cranial expansion is the Builder monoid, which constructs a big delayed computation to efficiently fill out a lazy ByteString when the Put monad is run. The optimisations haven't perverted the readability of the code much, but it's still quite fast. I've clocked end-to-end serialisation and deserialisation over an Infiniband network at 234 MB/sec (~25% of line rate). This consumed about 90% of one CPU on the sending side, while the receiving side was 100% pegged.

On Wed, 2007-02-28 at 09:51 -0800, Bryan O'Sullivan wrote:
Bulat Ziganshin wrote:
can you please provide examples of such code?
I'd recommend taking a look at the new binary package. It's very cleanly written, and mostly easy to understand. It's also easy to see where the optimisations have been made. The only part that might induce sudden cranial expansion is the Builder monoid, which constructs a big delayed computation to efficiently fill out a lazy ByteString when the Put monad is run.
The optimisations haven't perverted the readability of the code much, but it's still quite fast. I've clocked end-to-end serialisation and deserialisation over an Infiniband network at 234 MB/sec (~25% of line rate). This consumed about 90% of one CPU on the sending side, while the receiving side was 100% pegged.
Yes, we've optimised the writing side much more than the reading side at the moment. I'm sure it's possible to get the reading up to at least the same speed. I think we ought to be able to push both further after that because we're still not doing the bounds check commoning up transformation that we'd planned on (more GHC rules). So given enough hacking time there's more performance available. In our benchmarks reading is currently about 40% of writing speed and we're topping out at about 30-40% of maximum memory bandwidth for writes. Apparently that's only 200x faster than the faster of two common python serialisation libs, so we've got some way to go yet. Duncan

On Wed, 2007-02-28 at 16:38 +0300, Bulat Ziganshin wrote:
Hello Duncan,
Tuesday, February 27, 2007, 1:09:53 PM, you wrote:
can you please provide examples of such code? the ultimate goal is to have tutorial on writing efficient code in high-level manner, but that is too ambitious. just some examples with a little explanation of how this is compiled?
The main example of course is ByteString fusion as presented in our recent paper: http://www.cse.unsw.edu.au/~dons/papers/CSL06.html The first example in the paper is the simple pipeline: return · foldl hash 5381 · map toLower · filter isAlpha =< readFile f where hash h c = h ∗ 33 + ord c This is fairly fast when just using the ordinary ByteString implementations of foldl map filter etc but the naive C version is still faster. That is, just composing highly tuned low level implementations of list like combinators is not enough. We do not want to the python approach of stitching together fast C/C++ code with slow glue code. Instead we use the above more like a spec and try to generate high performance low level code. We transform each of the functions foldl, map & filter into their stream style counterparts, fuse them together and inline everything (the compiler also does some other critical optimisations like case-of-case) to get a single imperative loop. The result is competitive with C, indeed we have to rewrite the C code into an even uglier form to be able to beat the Haskell version. Duncan

The main example of course is ByteString fusion as presented in our recent paper: http://www.cse.unsw.edu.au/~dons/papers/CSL06.html
btw, why did you restrict yourself to improving [Char], rather than [a]? naively, it would seem to me that most of the framework should work just as well for the general case, with some additional improvements through specialising a to Char. and if that is the case, it hurts to think that there is a nice framework out there that i can't use unless my a is Char. claus

claus.reinke:
The main example of course is ByteString fusion as presented in our recent paper: http://www.cse.unsw.edu.au/~dons/papers/CSL06.html
btw, why did you restrict yourself to improving [Char], rather than [a]?
naively, it would seem to me that most of the framework should work just as well for the general case, with some additional improvements through specialising a to Char. and if that is the case, it hurts to think that there is a nice framework out there that i can't use unless my a is Char.
claus
Fear not! Arbitrary arrays of 'a' are in the works, thanks to Roman's data parallel Haskell work. There's also already Storable a => Vector a, written for last year's SoC. Just needs some polishing. We should get around to that soon, I think. -- Don

On Thu, 2007-03-01 at 00:22 +0000, Claus Reinke wrote:
The main example of course is ByteString fusion as presented in our recent paper: http://www.cse.unsw.edu.au/~dons/papers/CSL06.html
btw, why did you restrict yourself to improving [Char], rather than [a]?
We're not finished! :-) It's just the way our investigation took us. We started with an important special case. It's clear now that we can generalise. Note that when we started the ByteString work it was using quite a bit of C code and C-like Haskell code. It looked at first like most of the advantages were from the specialised representation.
naively, it would seem to me that most of the framework should work just as well for the general case, with some additional improvements through specialising a to Char.
Yes, as we say in the paper, we think that the fusion framework should work for any kind of sequence data structure.
and if that is the case, it hurts to think that there is a nice framework out there that i can't use unless my a is Char.
We're now looking at stream fusion for lists in general and as Don says, there's also the NDP work which is looking at arrays of arbitrary element type and with complex structure. Duncan

Hello Duncan, Thursday, March 1, 2007, 1:21:49 PM, you wrote:
We're now looking at stream fusion for lists in general and as Don says, there's also the NDP work which is looking at arrays of arbitrary element type and with complex structure.
shortly speaking, this may become the most important FP research of 2000's just like monads for imperative programming became most influential paper of 90's -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (5)
-
Bryan O'Sullivan
-
Bulat Ziganshin
-
Claus Reinke
-
dons@cse.unsw.edu.au
-
Duncan Coutts