Actually, he was already working on them. I just joined him because I've wanted to have them for a while too.
I don't have a real example using the prefetch (we just got it working!) but I do have the test case we used to make sure they were implemented correctly.
The first prefetches the location we're just about to look at, the second prefetches 10 lookups ahead. Its completely untunned, the GC seems to throw a bunch of garbage into the nursery but the prefetch version is notably faster.
We were just using this to make sure the implementation worked so we didn't put too much effort into it. Originally we were working with binary search and speculative look-ahead, but that was far harder to maintain as the prefetch ops changed. (A bundled binary search is a lot easier of course) Sadly the early binary search results are of no use because in their early form using them created massive nursery garbage pressure when in pure code. We had one (very briefly) in IO I believe but I lack any information on it.
Yes - using prefetch well is hard. Yes - the optimal use of prefetch depends on the exact CPU and memory you have.
The best way to deal with this is of course to plan for it and that can be warranted with some code. Code that does a lot of random access but doesn't use too much memory bandwidth can be fairly tolerant to prefetch distance though. For example the demo should drop to near-optimal quickly and its performance should take quite a while to start dropping again. I believe on my system the "nearly optimal" range was around 5 untill around 40.
One nice version of a prefetch that I talked to Carter about was "stick the prefetch approximately the right number of instructions ahead of here" I expect that is too complicated to implement though.
As we don't have thunk prefetchers currently (sad), I can't give you a demo showing a speed up in the containers package, but I could write the bundled binary search demo for you if you like.