help wrt semantics / primops for pure prefetches

Hey Everyone, in https://ghc.haskell.org/trac/ghc/ticket/9353 and https://phabricator.haskell.org/D350 is some preliminary work to fix up how the pure versions of the prefetch primops work is laid out and prototyped. However, while it nominally fixes up some of the problems with how the current pure prefetch apis are fundamentally borken, the simple design in D350 isn't quite ideal, and i sketch out some other ideas in the associated ticket #9353 I'd like to make sure pure prefetch in 7.10 is slightly less broken than in 7.8, but either way, its pretty clear that working out the right fixed up design wont happen till 7.12. Ie, whatever makes 7.10, there WILL have to be breaking changes to fix those primops for 7.12 thanks and any feedback / thoughts appreciated -Carter

I've gotten some feedback and suggestions on IRC from Edwardk, i'm currently massaging the new (useful!) form through validate, feedback on https://phabricator.haskell.org/D350 would be appreciated! On Sat, Nov 22, 2014 at 12:43 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Hey Everyone, in https://ghc.haskell.org/trac/ghc/ticket/9353 and https://phabricator.haskell.org/D350
is some preliminary work to fix up how the pure versions of the prefetch primops work is laid out and prototyped.
However, while it nominally fixes up some of the problems with how the current pure prefetch apis are fundamentally borken, the simple design in D350 isn't quite ideal, and i sketch out some other ideas in the associated ticket #9353
I'd like to make sure pure prefetch in 7.10 is slightly less broken than in 7.8, but either way, its pretty clear that working out the right fixed up design wont happen till 7.12. Ie, whatever makes 7.10, there WILL have to be breaking changes to fix those primops for 7.12
thanks and any feedback / thoughts appreciated -Carter

I haven't been watching this, but I have one question: does prefetching actually *work*? Do you have benchmarks (or better still, actual library/application code) that show some improvement? I admit to being slightly sceptical - when I've tried using prefetching in the GC it has always been a struggle to get something that shows an improvement, and even when I get things tuned on one machine it typically makes things slower on a different processor. And that's in the GC, doing it at the Haskell level should be even harder. Cheers, Simon On 22/11/2014 05:43, Carter Schonwald wrote:
Hey Everyone, in https://ghc.haskell.org/trac/ghc/ticket/9353 and https://phabricator.haskell.org/D350
is some preliminary work to fix up how the pure versions of the prefetch primops work is laid out and prototyped.
However, while it nominally fixes up some of the problems with how the current pure prefetch apis are fundamentally borken, the simple design in D350 isn't quite ideal, and i sketch out some other ideas in the associated ticket #9353
I'd like to make sure pure prefetch in 7.10 is slightly less broken than in 7.8, but either way, its pretty clear that working out the right fixed up design wont happen till 7.12. Ie, whatever makes 7.10, there WILL have to be breaking changes to fix those primops for 7.12
thanks and any feedback / thoughts appreciated -Carter
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

My general experience with prefetching is that it is almost never a win
when done just on trees, as in the usual mark-sweep or copy-collection
garbage collector walk. Why? Because the time from the time you prefetch to
the time you use the data is too variable. Stack disciplines and prefetch
don't mix nicely.
If you want to see a win out of it you have to free up some of the ordering
of your walk, and tweak your whole application to support it. e.g. if you
want to use prefetching in garbage collection, the way to do it is to
switch from a strict stack discipline to using a small fixed-sized queue on
the output of the stack, then feed prefetch on the way into the queue
rather than as you walk the stack. That paid out for me as a 10-15% speedup
last time I used it after factoring in the overhead of the extra queue. Not
too bad for a weekend project. =)
Without that sort of known lead-in time, it works out that prefetching is
usually a net loss or vanishes into the noise.
As for the array ops, davean has a couple of cases w/ those for which the
prefetching operations are a 20-25% speedup, which is what motivated Carter
to start playing around with these again. I don't know off hand how easily
those can be turned into public test cases though.
-Edward
On Thu, Nov 27, 2014 at 4:36 AM, Simon Marlow
I haven't been watching this, but I have one question: does prefetching actually *work*? Do you have benchmarks (or better still, actual library/application code) that show some improvement? I admit to being slightly sceptical - when I've tried using prefetching in the GC it has always been a struggle to get something that shows an improvement, and even when I get things tuned on one machine it typically makes things slower on a different processor. And that's in the GC, doing it at the Haskell level should be even harder.
Cheers, Simon
On 22/11/2014 05:43, Carter Schonwald wrote:
Hey Everyone, in https://ghc.haskell.org/trac/ghc/ticket/9353 and https://phabricator.haskell.org/D350
is some preliminary work to fix up how the pure versions of the prefetch primops work is laid out and prototyped.
However, while it nominally fixes up some of the problems with how the current pure prefetch apis are fundamentally borken, the simple design in D350 isn't quite ideal, and i sketch out some other ideas in the associated ticket #9353
I'd like to make sure pure prefetch in 7.10 is slightly less broken than in 7.8, but either way, its pretty clear that working out the right fixed up design wont happen till 7.12. Ie, whatever makes 7.10, there WILL have to be breaking changes to fix those primops for 7.12
thanks and any feedback / thoughts appreciated -Carter
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Actually, he was already working on them. I just joined him because I've
wanted to have them for a while too.
I don't have a real example using the prefetch (we just got it working!)
but I do have the test case we used to make sure they were implemented
correctly.
The code is http://lpaste.net/4551930812748005376 (I recommend using llvm
when testing this but it shows the improvement either way) and, two
separate runs of 5 times each:
http://lpaste.net/5476270150657245184
http://lpaste.net/1229927210207412224
The first prefetches the location we're just about to look at, the second
prefetches 10 lookups ahead. Its completely untunned, the GC seems to throw
a bunch of garbage into the nursery but the prefetch version is notably
faster.
We were just using this to make sure the implementation worked so we didn't
put too much effort into it. Originally we were working with binary search
and speculative look-ahead, but that was far harder to maintain as the
prefetch ops changed. (A bundled binary search is a lot easier of course)
Sadly the early binary search results are of no use because in their early
form using them created massive nursery garbage pressure when in pure code.
We had one (very briefly) in IO I believe but I lack any information on it.
Yes - using prefetch well is hard. Yes - the optimal use of prefetch
depends on the exact CPU and memory you have.
The best way to deal with this is of course to plan for it and that can be
warranted with some code. Code that does a lot of random access but doesn't
use too much memory bandwidth can be fairly tolerant to prefetch distance
though. For example the demo should drop to near-optimal quickly and its
performance should take quite a while to start dropping again. I believe on
my system the "nearly optimal" range was around 5 untill around 40.
One nice version of a prefetch that I talked to Carter about was "stick the
prefetch approximately the right number of instructions ahead of here" I
expect that is too complicated to implement though.
As we don't have thunk prefetchers currently (sad), I can't give you a demo
showing a speed up in the containers package, but I could write the bundled
binary search demo for you if you like.
On Thu, Nov 27, 2014 at 5:20 AM, Edward Kmett
My general experience with prefetching is that it is almost never a win when done just on trees, as in the usual mark-sweep or copy-collection garbage collector walk. Why? Because the time from the time you prefetch to the time you use the data is too variable. Stack disciplines and prefetch don't mix nicely.
If you want to see a win out of it you have to free up some of the ordering of your walk, and tweak your whole application to support it. e.g. if you want to use prefetching in garbage collection, the way to do it is to switch from a strict stack discipline to using a small fixed-sized queue on the output of the stack, then feed prefetch on the way into the queue rather than as you walk the stack. That paid out for me as a 10-15% speedup last time I used it after factoring in the overhead of the extra queue. Not too bad for a weekend project. =)
Without that sort of known lead-in time, it works out that prefetching is usually a net loss or vanishes into the noise.
As for the array ops, davean has a couple of cases w/ those for which the prefetching operations are a 20-25% speedup, which is what motivated Carter to start playing around with these again. I don't know off hand how easily those can be turned into public test cases though.
-Edward
On Thu, Nov 27, 2014 at 4:36 AM, Simon Marlow
wrote: I haven't been watching this, but I have one question: does prefetching actually *work*? Do you have benchmarks (or better still, actual library/application code) that show some improvement? I admit to being slightly sceptical - when I've tried using prefetching in the GC it has always been a struggle to get something that shows an improvement, and even when I get things tuned on one machine it typically makes things slower on a different processor. And that's in the GC, doing it at the Haskell level should be even harder.
Cheers, Simon
On 22/11/2014 05:43, Carter Schonwald wrote:
Hey Everyone, in https://ghc.haskell.org/trac/ghc/ticket/9353 and https://phabricator.haskell.org/D350
is some preliminary work to fix up how the pure versions of the prefetch primops work is laid out and prototyped.
However, while it nominally fixes up some of the problems with how the current pure prefetch apis are fundamentally borken, the simple design in D350 isn't quite ideal, and i sketch out some other ideas in the associated ticket #9353
I'd like to make sure pure prefetch in 7.10 is slightly less broken than in 7.8, but either way, its pretty clear that working out the right fixed up design wont happen till 7.12. Ie, whatever makes 7.10, there WILL have to be breaking changes to fix those primops for 7.12
thanks and any feedback / thoughts appreciated -Carter
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Thanks for this. In the copying GC I was using prefetching during the scan phase, where you do have a pretty good tunable knob for how far ahead you want to prefetch. The only variable is the size of the objects being copied, but most tend to be in the 2-4 words range. I did manage to get 10-15% speedups with optimal tuning, but it was a slowdown on a different machine or with wrong tuning, which is why GHC doesn't have any of this right now. Glad to hear this can actually be used to get real speedups in Haskell, I will be less sceptical from now on :) Cheers, Simon On 27/11/2014 10:20, Edward Kmett wrote:
My general experience with prefetching is that it is almost never a win when done just on trees, as in the usual mark-sweep or copy-collection garbage collector walk. Why? Because the time from the time you prefetch to the time you use the data is too variable. Stack disciplines and prefetch don't mix nicely.
If you want to see a win out of it you have to free up some of the ordering of your walk, and tweak your whole application to support it. e.g. if you want to use prefetching in garbage collection, the way to do it is to switch from a strict stack discipline to using a small fixed-sized queue on the output of the stack, then feed prefetch on the way into the queue rather than as you walk the stack. That paid out for me as a 10-15% speedup last time I used it after factoring in the overhead of the extra queue. Not too bad for a weekend project. =)
Without that sort of known lead-in time, it works out that prefetching is usually a net loss or vanishes into the noise.
As for the array ops, davean has a couple of cases w/ those for which the prefetching operations are a 20-25% speedup, which is what motivated Carter to start playing around with these again. I don't know off hand how easily those can be turned into public test cases though.
-Edward
On Thu, Nov 27, 2014 at 4:36 AM, Simon Marlow
mailto:marlowsd@gmail.com> wrote: I haven't been watching this, but I have one question: does prefetching actually *work*? Do you have benchmarks (or better still, actual library/application code) that show some improvement? I admit to being slightly sceptical - when I've tried using prefetching in the GC it has always been a struggle to get something that shows an improvement, and even when I get things tuned on one machine it typically makes things slower on a different processor. And that's in the GC, doing it at the Haskell level should be even harder.
Cheers, Simon
On 22/11/2014 05:43, Carter Schonwald wrote:
Hey Everyone, in https://ghc.haskell.org/trac/__ghc/ticket/9353 https://ghc.haskell.org/trac/ghc/ticket/9353 and https://phabricator.haskell.__org/D350 https://phabricator.haskell.org/D350
is some preliminary work to fix up how the pure versions of the prefetch primops work is laid out and prototyped.
However, while it nominally fixes up some of the problems with how the current pure prefetch apis are fundamentally borken, the simple design in D350 isn't quite ideal, and i sketch out some other ideas in the associated ticket #9353
I'd like to make sure pure prefetch in 7.10 is slightly less broken than in 7.8, but either way, its pretty clear that working out the right fixed up design wont happen till 7.12. Ie, whatever makes 7.10, there WILL have to be breaking changes to fix those primops for 7.12
thanks and any feedback / thoughts appreciated -Carter
_________________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://www.haskell.org/__mailman/listinfo/ghc-devs http://www.haskell.org/mailman/listinfo/ghc-devs
_________________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://www.haskell.org/__mailman/listinfo/ghc-devs http://www.haskell.org/mailman/listinfo/ghc-devs

The main takeaway I had from my work with prefetching was that if you can
shove things into a fixed-sized queue and prefetch on the way into the
queue instead of doing it just to sort of kickstart the next element during
a tree traversal that is going to be demanded too fast to take full
advantage of the latency, then you can smooth out a lot of the cross system
variance.
It is just incredibly invasive. =(
Re: doing prefetching in the mark phase, I just skimmed and found
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.9090&rep=rep1&type=pdf
takes which appears to take a similar approach.
-Edward
On Fri, Nov 28, 2014 at 3:42 AM, Simon Marlow
Thanks for this. In the copying GC I was using prefetching during the scan phase, where you do have a pretty good tunable knob for how far ahead you want to prefetch. The only variable is the size of the objects being copied, but most tend to be in the 2-4 words range. I did manage to get 10-15% speedups with optimal tuning, but it was a slowdown on a different machine or with wrong tuning, which is why GHC doesn't have any of this right now.
Glad to hear this can actually be used to get real speedups in Haskell, I will be less sceptical from now on :)
Cheers, Simon
On 27/11/2014 10:20, Edward Kmett wrote:
My general experience with prefetching is that it is almost never a win when done just on trees, as in the usual mark-sweep or copy-collection garbage collector walk. Why? Because the time from the time you prefetch to the time you use the data is too variable. Stack disciplines and prefetch don't mix nicely.
If you want to see a win out of it you have to free up some of the ordering of your walk, and tweak your whole application to support it. e.g. if you want to use prefetching in garbage collection, the way to do it is to switch from a strict stack discipline to using a small fixed-sized queue on the output of the stack, then feed prefetch on the way into the queue rather than as you walk the stack. That paid out for me as a 10-15% speedup last time I used it after factoring in the overhead of the extra queue. Not too bad for a weekend project. =)
Without that sort of known lead-in time, it works out that prefetching is usually a net loss or vanishes into the noise.
As for the array ops, davean has a couple of cases w/ those for which the prefetching operations are a 20-25% speedup, which is what motivated Carter to start playing around with these again. I don't know off hand how easily those can be turned into public test cases though.
-Edward
On Thu, Nov 27, 2014 at 4:36 AM, Simon Marlow
mailto:marlowsd@gmail.com> wrote: I haven't been watching this, but I have one question: does prefetching actually *work*? Do you have benchmarks (or better still, actual library/application code) that show some improvement? I admit to being slightly sceptical - when I've tried using prefetching in the GC it has always been a struggle to get something that shows an improvement, and even when I get things tuned on one machine it typically makes things slower on a different processor. And that's in the GC, doing it at the Haskell level should be even harder.
Cheers, Simon
On 22/11/2014 05:43, Carter Schonwald wrote:
Hey Everyone, in https://ghc.haskell.org/trac/__ghc/ticket/9353 https://ghc.haskell.org/trac/ghc/ticket/9353 and https://phabricator.haskell.__org/D350 https://phabricator.haskell.org/D350
is some preliminary work to fix up how the pure versions of the prefetch primops work is laid out and prototyped.
However, while it nominally fixes up some of the problems with how the current pure prefetch apis are fundamentally borken, the simple design in D350 isn't quite ideal, and i sketch out some other ideas in the associated ticket #9353
I'd like to make sure pure prefetch in 7.10 is slightly less broken than in 7.8, but either way, its pretty clear that working out the right fixed up design wont happen till 7.12. Ie, whatever makes 7.10, there WILL have to be breaking changes to fix those primops for 7.12
thanks and any feedback / thoughts appreciated -Carter
_________________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://www.haskell.org/__mailman/listinfo/ghc-devs http://www.haskell.org/mailman/listinfo/ghc-devs
_________________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://www.haskell.org/__mailman/listinfo/ghc-devs http://www.haskell.org/mailman/listinfo/ghc-devs

On Thu, Nov 27, 2014 at 1:36 AM, Simon Marlow
I haven't been watching this, but I have one question: does prefetching actually *work*? Do you have benchmarks (or better still, actual library/application code) that show some improvement? I admit to being slightly sceptical - when I've tried using prefetching in the GC it has always been a struggle to get something that shows an improvement, and even when I get things tuned on one machine it typically makes things slower on a different processor. And that's in the GC, doing it at the Haskell level should be even harder.
I've gotten some speedups (for some operations) from speculative prefetch
for hash tables -- e.g. for cuckoo hash where you know the key can be in
one of two buckets with p=0.5. Prefetching one while you search the other
lets you squeeze out some instruction-level parallelism, at the expense of
cache thrashing.
The cache thrashing issue means that whether prefetching works for you
depends a lot on your inputs: it can help if your program can handle some
additional cache pressure, and it might hurt you otherwise.
G
--
Gregory Collins

indeed, its a *tool* that cut both ways, and needs to pay for itself with
benchmarks!
part of why i'm pushing for the has_side_effects=True attribute in
https://phabricator.haskell.org/D350
is because it'll help make it more predictable when the instruction will
exactly fire, because using it effectively requires pretty precise
understanding of application level memory pressure!
@greg, i'd love more code review if you dont mind :)
On Sat, Nov 29, 2014 at 9:15 PM, Gregory Collins
On Thu, Nov 27, 2014 at 1:36 AM, Simon Marlow
wrote: I haven't been watching this, but I have one question: does prefetching actually *work*? Do you have benchmarks (or better still, actual library/application code) that show some improvement? I admit to being slightly sceptical - when I've tried using prefetching in the GC it has always been a struggle to get something that shows an improvement, and even when I get things tuned on one machine it typically makes things slower on a different processor. And that's in the GC, doing it at the Haskell level should be even harder.
I've gotten some speedups (for some operations) from speculative prefetch for hash tables -- e.g. for cuckoo hash where you know the key can be in one of two buckets with p=0.5. Prefetching one while you search the other lets you squeeze out some instruction-level parallelism, at the expense of cache thrashing.
The cache thrashing issue means that whether prefetching works for you depends a lot on your inputs: it can help if your program can handle some additional cache pressure, and it might hurt you otherwise.
G -- Gregory Collins

On Sat, Nov 29, 2014 at 6:33 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
@greg, i'd love more code review if you dont mind :)
I'm not familiar enough with the relevant code to have anything to say
about your patch one way or the other, I'm afraid :(.
LGTM, in principle.
G
--
Gregory Collins
participants (5)
-
Carter Schonwald
-
davean
-
Edward Kmett
-
Gregory Collins
-
Simon Marlow