This runs on a dedicated physical machine, and still the run-time
numbers were varying too widely and gave us many false warnings (and
probably reported many false improvements which we of course were happy
to believe). I have since switched to measuring only dynamic
instruction counts with valgrind. This means that we cannot detect
improvement or regressions due to certain low-level stuff, but we gain
the ability to reliably measure *something* that we expect to change
when we improve (or accidentally worsen) the high-level
transformations.
  4% is far from being "big", look e.g. at 
https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues
 where changing just the alignment of the code lead to a 10% difference.
 :-/ The code itself or its layout wasn't changed at all. The "Producing
 Wrong Data Without Doing Anything Obviously Wrong!" paper gives more 
funny examples.
  I'm not saying that code layout has no impact, quite the 
opposite. The main
 point is: Do we really have a benchmarking machinery in place which can
 tell you if you've improved the real run time or made it worse? I doubt
 that, at least at the scale of a few percent. To reach just that simple
 yes/no conclusion, you would need quite a heavy machinery involving 
randomized linking order, varying environments (in the sense of "number 
and contents of environment variables"), various CPU models etc. If you 
do not do that, modern HW will leave you with a lot of "WTF?!" moments 
and wrong conclusions.
You raise good points. While the example in the blog seems a bit 
constructed with the whole loop fitting in a cache line the principle is
 a real concern though.