[Haskell-cafe] Re: FASTER primes

30 Dec 2009

      Daniel Fischer  writes:
...
Am Dienstag 29 Dezember 2009 14:34:03 schrieb Will Ness:
...
Daniel Fischer  writes:
...
Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
...
Now _this_, when tested as interpreted code in GHCi, runs about 2.5x
times faster than Priority Queue based code from Melissa O'Neill's ZIP
package mentioned at the haskellwiki/Prime_Numbers page, with
about half used memory reported, in producing 10,000 to 300,000 primes.
It is faster than BayerPrimes.hs from the ZIP package too, in the
tested range, at about 35 lines of code in total.
That's nice. However, the important criterion is how compiled code (-O2)
fares. Do the relations continue to hold? How does it compare to a
bitsieve?
Haven't gotten to that part yet. :)
But why is it more important?
I thought the uppercase FASTER in the subject meant you were really 
interested in speed.
If you're only interested in asymptotics, interpreted may be appropriate.

...
However, it is possible that optimisation can change the perceived 
asymptotics of an algorithm (determined strictness eliminating thunks for
example).
While I haven't detected that with the primes code, I find that in my ghci
your code is approximately 2.5 times faster than ONeill or Bayer when
interpreted (no difference in scaling observed), while when compiled 
with -O2, ONeill is approximately three times as fast as your code
that was what I was getting at first too, before I've put into my code the
_type_signatures_ and the "specialize" _pragmas_ as per her file. Then it was
only 1.3x slower, when compiled (with about same asymptotics and memory usage).
...
and twice as fast as Bayer as an executable, about twice as fast as your code
and slightly slower than Bayer in ghci.
see, this kind of inconsistencies is exactly why I was concentrating only on 
one platform in measuring the speed - the interp'/GHCi combination. Especially 
when developing and trying out several approaches, to test with compiler just 
takes too long. :) And why should it give (sometimes) wildly different readings 
when running inside GHCi or standalone ??
...
And I have huge memory problems in ghci with your code.
That may be due to my implementation of merge and minus, though. You wrote
'standard' and I coded the straightforward methods.
Here's what I'm using (BTW I've put it on the primes haskellwiki page too). The
memory reported for interpreted is about half of PQ's (IIRC), and compiled - the
same:

 minus a@(x:xs) b@(y:ys) = case compare x y of 
        LT -> x: xs `minus` b
        GT ->    a  `minus` ys
        EQ ->    xs `minus` ys
 minus a b  = a

 merge a@(x:xs) b@(y:ys) = case compare x y of
        LT -> x: merge xs b
        EQ -> x: merge xs ys
        GT -> y: merge a  ys
 merge a b  = if null b then a else b
...
...
Would that not tell us more about the compiler performance than the code 
itself?
Unless you write machine code or assembly, don't all performance tests tell us
more about the compiler/interpreter performance than the code itself?
...
That is, of course, with respect to algorithms with the same scaling behaviour.
...
This code is just an endpoint (so far) in a short procession of natural
stepwise development of the famous classic Turner's sieve,
That was
sieve (x:xs) = x:sieve (filter ((/= 0) . (`mod` x)) xs)
, was it?
right
...
...
through the
"postponed filters", through to Euler's sieve, the merging sieve (i.e.
Richard Bird's) and on to the tree-fold merging, with wheel. I just wanted
to see where the simple "normal" (i.e. _beginner_-friendly) functional code
can get, in a natural way.
Good.
...
It's not about writing the fastest code in _advanced_ Haskell. It's about
having clear and simple code that can be understood at a glance - i.e.
contributes to our understanding of a problem - faithfully reflecting its
essential elements, and because of _that_, fast. It's kind of like _not_
using mutable arrays in a quicksort.
What's wrong with mutable arrays? There are a lot of algorithms which can be
easily and efficiently implemented using mutable unboxed arrays while a
comparably efficient implementation without mutable arrays is hard. For 
those, I consider STUArrays the natural choice. Sieving primes falls into 
that category.
It's just that the mutating code tends to be convoluted, like in the example I
mentioned of quicksort. One has to read the C code with good attention to
understand it. "Normal" Haskell is much more visually apparent, like

  primes = 2: 3: sieve (tail primes) [5,7..]  
   where
    sieve (p:ps) xs = h ++ sieve ps (t `minus` tail [q,q+2*p..])
                      where (h,~(_:t)) = span (< q) xs 
                            q          = p*p

or

  primes = 2: 3: sieve [] (tail primes) 5  
   where 
    sieve fs (p:ps) x = [i | i<- [x,x+2..q-2], a!i] 
                          ++ sieve ((2*p,q):fs') ps (q+2)
     where
      q           = p*p
      mults       = [ [y+s,y+2*s..q] | (s,y)<- fs]
      fs'         = [ (s,last ms)    | ((s,_),ms)<- zip fs mults]
      a           = accumArray (\a b->False) True (x,q-2) 
                             [(i,()) | ms<- mults, i<- ms]
...
...
Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't
feel right to me somehow,
I fully agree.
:) :) :) :) :)
...
...
especially the claim that going by primes squares
is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic
complexity *and* reduces the constant facors significantly.

Exactly! Seeing this claim was just incredible to me. I've spent a 
considerable
time when I first learned Haskell, tweaking the SICP code (as I remembred it;
probably very similar to Turner's) until coming up with an equivalent of the
"postponed sieve" (some years ago, didn't know about "span" yet :) ). But I
assumed that this result was well known. Turner's sieve should long be 
regarded
as _specification_, not an actual _code_, I thought.

I think what happened was that Melissa O'Neill thought about the mutable 
storage
i.e. "imperative" implementation when she said that, where numbers do get
"crossed off" from the same "canvas". But here in functional code we don't
"cross off" no numbers; we deal with numbers supply and filtering and merging,
and nested function calls with their overhead etc., which costs can't be just
ignored. IOW there's no "crossing off" done by any of extra filters, which
nevertheless are all VERY busy, doing nothing. _Not_ "crossing" the multiples 
"off".
...
...
what with the postponed filters
(which serves as the framework for all the other variants) achieving the
orders of magnitude speedup and cutting the Turner's O(n^2) right down to
O(n^1.5) just by doing that squares optimization (with the final version
hovering around 1.24..1.17 in the tested range). The Euler's sieve being a
special case of Eratosthenes's, too, doesn't let credence to claims that
only the PQ version is somehow uniquely authentic and "faithful" to it.
I never found that claim convincing either.
I think what got crossed probably was "faithful to the original algorithm", 
with
"faithful" to its typical imperative mutable storage implementation, as in
"having same _complexity_". In that sense of course, linear merging is worse; 
it
has worse complexity than "C", but is nevertheless faithful to the original
algorithm, only under the functional setting. It is worse because of linear
nature of lists, and it is all too easy to overlook the possibility of tree
folding and jump to the conclusion that one needs a specialized data structure
for that... But the article didn't even get to that part; instead it was all
about proving rigorously that divisibility testing for primes is very costly
(without actually formulating this conclusion). It was frustrating to read 
that
"details of what gets crossed off and how, matter" without these details being
actually spelled out - simply, that primes shouldn't be tested at all. That's
the real insight of the article, IMO.
...
...
Turner's sieve should have been always looked at as just a specification,
not a code, anyway, and actually running it is ridiculous. Postponed
filters version, is the one to be used as a reference point of the basic
_code_, precisely because it _does_ use the primes squares optimization,
which _is_ essential to any basic sieve.
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe