The main takeaway I had from my work with prefetching was that if you can shove things into a fixed-sized queue and prefetch on the way into the queue instead of doing it just to sort of kickstart the next element during a tree traversal that is going to be demanded too fast to take full advantage of the latency, then you can smooth out a lot of the cross system variance.

It is just incredibly invasive. =(

Re: doing prefetching in the mark phase, I just skimmed and found http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.9090&rep=rep1&type=pdf takes which appears to take a similar approach.

-Edward

On Fri, Nov 28, 2014 at 3:42 AM, Simon Marlow <marlowsd@gmail.com> wrote:
Thanks for this.  In the copying GC I was using prefetching during the scan phase, where you do have a pretty good tunable knob for how far ahead you want to prefetch.  The only variable is the size of the objects being copied, but most tend to be in the 2-4 words range.  I did manage to get 10-15% speedups with optimal tuning, but it was a slowdown on a different machine or with wrong tuning, which is why GHC doesn't have any of this right now.

Glad to hear this can actually be used to get real speedups in Haskell, I will be less sceptical from now on :)

Cheers,
Simon

On 27/11/2014 10:20, Edward Kmett wrote:
My general experience with prefetching is that it is almost never a win
when done just on trees, as in the usual mark-sweep or copy-collection
garbage collector walk. Why? Because the time from the time you prefetch
to the time you use the data is too variable. Stack disciplines and
prefetch don't mix nicely.

If you want to see a win out of it you have to free up some of the
ordering of your walk, and tweak your whole application to support it.
e.g. if you want to use prefetching in garbage collection, the way to do
it is to switch from a strict stack discipline to using a small
fixed-sized queue on the output of the stack, then feed prefetch on the
way into the queue rather than as you walk the stack. That paid out for
me as a 10-15% speedup last time I used it after factoring in the
overhead of the extra queue. Not too bad for a weekend project. =)

Without that sort of known lead-in time, it works out that prefetching
is usually a net loss or vanishes into the noise.

As for the array ops, davean has a couple of cases w/ those for which
the prefetching operations are a 20-25% speedup, which is what motivated
Carter to start playing around with these again. I don't know off hand
how easily those can be turned into public test cases though.

-Edward

On Thu, Nov 27, 2014 at 4:36 AM, Simon Marlow <marlowsd@gmail.com
<mailto:marlowsd@gmail.com>> wrote:

    I haven't been watching this, but I have one question: does
    prefetching actually *work*?  Do you have benchmarks (or better
    still, actual library/application code) that show some improvement?
    I admit to being slightly sceptical - when I've tried using
    prefetching in the GC it has always been a struggle to get something
    that shows an improvement, and even when I get things tuned on one
    machine it typically makes things slower on a different processor.
    And that's in the GC, doing it at the Haskell level should be even
    harder.

    Cheers,
    Simon


    On 22/11/2014 05:43, Carter Schonwald wrote:

        Hey Everyone,
        in
        https://ghc.haskell.org/trac/__ghc/ticket/9353
        <https://ghc.haskell.org/trac/ghc/ticket/9353>
        and
        https://phabricator.haskell.__org/D350
        <https://phabricator.haskell.org/D350>

        is some preliminary work to fix up how the pure versions of the
        prefetch
        primops work is laid out and prototyped.

        However, while it nominally fixes up some of the problems with
        how the
        current pure prefetch apis are fundamentally borken,  the simple
        design
        in D350 isn't quite ideal, and i sketch out some other ideas in the
        associated ticket #9353

        I'd like to make sure  pure prefetch in 7.10 is slightly less broken
        than in 7.8, but either way, its pretty clear that working out
        the right
        fixed up design wont happen till 7.12. Ie, whatever makes 7.10,
        there
        WILL have to be breaking changes to fix those primops for 7.12

        thanks and any feedback / thoughts appreciated
        -Carter


        _________________________________________________
        ghc-devs mailing list
        ghc-devs@haskell.org <mailto:ghc-devs@haskell.org>
        http://www.haskell.org/__mailman/listinfo/ghc-devs
        <http://www.haskell.org/mailman/listinfo/ghc-devs>

    _________________________________________________
    ghc-devs mailing list
    ghc-devs@haskell.org <mailto:ghc-devs@haskell.org>
    http://www.haskell.org/__mailman/listinfo/ghc-devs
    <http://www.haskell.org/mailman/listinfo/ghc-devs>