[Haskell-cafe] Re: FASTER primes

2 Jan 2010

      Daniel Fischer  writes:
...
Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness:
...
Daniel Fischer  writes:
...
Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
...
...
especially the claim that going by primes squares
is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic
complexity *and* reduces the constant factors significantly.
D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper.
It doesn't change the complexity, and the constant factors are reduced
far less than I thought.
I do not understand. Turner's sieve is
primes = sieve [2..]
   where
    sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]
and the Postponed Filters is
primes = 2: 3: sieve (tail primes) [5,7..]
   where
    sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0]
                      where (h,~(_:t)) = span (< p*p) xs
Are you saying they both exhibit same complexity?
No. They don't.
But if you're looking at an imperative (mutable array) sieve (that's simpler
to analyse because you don't have to take the book-keeping costs of your
priority queue, heap or whatever into account), if you start crossing out
The key question then, is _*WHEN*_ and not _*WHAT*_. As is clearly demonstrated
by the case of Turner/Postponed filters, the work that is done (of crossing
numbers off) is the same - _when_ it is actually done - but Turner's starts out
so _prematurely_ that it is busy doing nothing most of the time. Thus its
function call overhead costs pile up enormously, overstaging the actual
calculation. 

So analyzing that calculation in the premature execution setting is missing the
point, although helpful after we fix this, with the Postponed Filters. _Only
then_ the finer points of this algorithm's analysis can be applied - namely, of
avoiding testing primes divisibility altogether. And _if_ a fast cheap primality
test were to have existed, the filtering versions would win over, because they
progressively cull the input sequence so there would be no double hits as we
have when merging the multiples (whether from lists or inside the PQ).
...
the multiples of p with 2*p, you have
sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes]
crossings-out, that is Theta(bound*log (log bound)). If you eliminate 
multiples of some small primes a priori (wheel), you can reduce the constant 
factor significantly, but the complexity remains the same (you drop a few
terms from the front of the sum and multiply the remaining terms with 
phi(n)/n, where n is the product of the excluded primes).
If you start crossing out at p^2, the number is
sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes].
The difference is basically sum (takeWhile (<= sqrt bound) primes), which I 
stupidly - I don't remember how - believed to cancel out the main term. 
It doesn't, it's O(bound/log bound), so the complexity is the same.
Now if you take a stream of numbers from which you remove composites, having
a priority queue of multiples of primes, things are a little different.
 If you start crossing out at 2*p, when you are looking at n, you have more 
multiples in your PQ than if you start crossing out at p^2 (about pi(n/2) 
vs. pi(sqrt n)), so updating the PQ will be more expensive. But updating the 
PQ is O(log size), I believe, and log pi(n) is O(log pi(sqrt n)), so I think 
it shouldn't change the complexity here either. I think this would have 
complexity O(bound*log bound*log (log bound)).
There are two questions here - where to start crossing off numbers, and when. If
you'd start at 2*p maybe the overall complexity would remain the same but it'll
add enormous overhead with all those duplicate multiples. No, the question is
not where to start, but when. PQ might hide the problem until the memory blows
up. Anything that we add that won't have any chance of contributing to the final
result, is added for nothing and only drives the total cost up needlessly.
...
...
I was under impression that the first shows O(n^2) approx., and the second 
one O(n^1.5) (for n primes produced).
In Turner/postponed filters, things are really different. Actually, Ms. 
O'Neill is right, it is a different algorithm. In the above, we match each
what _is_ different is divisibility testing vs composites removal, which follows
from her in-depth analysis although is never quite formulated in such words in
her article. But nothing matters until the premature starting up is eliminated,
and that key observation is missing for the article either - worse, it is
brushed off with the casual remark that it is "a pleasing but minor
optimization". Which remark, as you show here, is true in the imperative,
mutable-storage setting, but is made in an article abut functional code, in
regard to the functional code of Turner's sieve. So the key understanding is
overlooked.

Her article even adds the primes into the PQ prematurely itself, as soon as the
prime is discovered (she fixes that in her ZIP package). With the PQ keeping
these prematurely added elements deep inside its guts, the problem might not
even manifest itself immediately, but the memory blow-up would eventually hit
the wall (having the PQ contain all the preceding primes, instead of just those
below the square root of a limit - all the entries above the square root not
contributing to the calculation at all).

And what we're after here is the insight anyway. Skipping steps of natural
development does not contribute to garnering an insight. Most prominent problem
with Turner's /code/ _specification_, is the premature start ups, and
divisibility testing of primes (as Melissa O'Neill's analysis shows). Fix one,
and you get Postponed Filters (which should really be used as a basis reference
point for all the rest). Fix the other - and you've got the Euler's sieve:

    primes = 2: 3: sieve (tail primes) [5,7..]
     where
      sieve (p:ps) xs = h ++ sieve ps [t `minus` tail [p*p, p*p+2*p..]]
                        where (h,~(_:t)) = span (< p*p) xs

Clear, succinct, and plenty efficient for an introductory textbook functional
lazy code, of the same order of magnitude performance as PQ code on odds only.

Now comparing the PQ performance against _that_ would only be fare, and IMO
would only add to its value - it is faster, has better asymptotics, and is
greatly amenable to the wheel optimization right away.
...
prime only with its multiples (plus the next composite in the PQ version). 
In Turner's algorithm, we match each prime with all smaller primes (and each 
composite with all primes <= its smallest prime factor, but that's less work 
than the primes). That gives indeed a complexity of O(pi(bound)^2).
In the postponed filters, we match each prime p with all primes <= sqrt p 
(again, the composites contribute less). I think that gives a complexity of 
O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes 
produced.
Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged multiples
removal around 1.25..1.17 .
...
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe