
Daniel Fischer
Am Dienstag 29 Dezember 2009 14:34:03 schrieb Will Ness:
Daniel Fischer
writes: Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package mentioned at the haskellwiki/Prime_Numbers page, with about half used memory reported, in producing 10,000 to 300,000 primes.
It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code in total.
That's nice. However, the important criterion is how compiled code (-O2)
fares. Do the relations continue to hold? How does it compare to a bitsieve?
Haven't gotten to that part yet. :)
But why is it more important?
I thought the uppercase FASTER in the subject meant you were really interested in speed. If you're only interested in asymptotics, interpreted may be appropriate.
However, it is possible that optimisation can change the perceived asymptotics of an algorithm (determined strictness eliminating thunks for example).
While I haven't detected that with the primes code, I find that in my ghci your code is approximately 2.5 times faster than ONeill or Bayer when interpreted (no difference in scaling observed), while when compiled with -O2, ONeill is approximately three times as fast as your code
that was what I was getting at first too, before I've put into my code the _type_signatures_ and the "specialize" _pragmas_ as per her file. Then it was only 1.3x slower, when compiled (with about same asymptotics and memory usage).
and twice as fast as Bayer as an executable, about twice as fast as your code and slightly slower than Bayer in ghci.
see, this kind of inconsistencies is exactly why I was concentrating only on one platform in measuring the speed - the interp'/GHCi combination. Especially when developing and trying out several approaches, to test with compiler just takes too long. :) And why should it give (sometimes) wildly different readings when running inside GHCi or standalone ??
And I have huge memory problems in ghci with your code. That may be due to my implementation of merge and minus, though. You wrote 'standard' and I coded the straightforward methods.
Here's what I'm using (BTW I've put it on the primes haskellwiki page too). The memory reported for interpreted is about half of PQ's (IIRC), and compiled - the same: minus a@(x:xs) b@(y:ys) = case compare x y of LT -> x: xs `minus` b GT -> a `minus` ys EQ -> xs `minus` ys minus a b = a merge a@(x:xs) b@(y:ys) = case compare x y of LT -> x: merge xs b EQ -> x: merge xs ys GT -> y: merge a ys merge a b = if null b then a else b
Would that not tell us more about the compiler performance than the code itself?
Unless you write machine code or assembly, don't all performance tests tell us
more about the compiler/interpreter performance than the code itself?
That is, of course, with respect to algorithms with the same scaling behaviour.
This code is just an endpoint (so far) in a short procession of natural stepwise development of the famous classic Turner's sieve,
That was
sieve (x:xs) = x:sieve (filter ((/= 0) . (`mod` x)) xs)
, was it?
right
through the "postponed filters", through to Euler's sieve, the merging sieve (i.e. Richard Bird's) and on to the tree-fold merging, with wheel. I just wanted to see where the simple "normal" (i.e. _beginner_-friendly) functional code can get, in a natural way.
Good.
It's not about writing the fastest code in _advanced_ Haskell. It's about having clear and simple code that can be understood at a glance - i.e. contributes to our understanding of a problem - faithfully reflecting its essential elements, and because of _that_, fast. It's kind of like _not_ using mutable arrays in a quicksort.
What's wrong with mutable arrays? There are a lot of algorithms which can be easily and efficiently implemented using mutable unboxed arrays while a comparably efficient implementation without mutable arrays is hard. For those, I consider STUArrays the natural choice. Sieving primes falls into that category.
It's just that the mutating code tends to be convoluted, like in the example I mentioned of quicksort. One has to read the C code with good attention to understand it. "Normal" Haskell is much more visually apparent, like primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps (t `minus` tail [q,q+2*p..]) where (h,~(_:t)) = span (< q) xs q = p*p or primes = 2: 3: sieve [] (tail primes) 5 where sieve fs (p:ps) x = [i | i<- [x,x+2..q-2], a!i] ++ sieve ((2*p,q):fs') ps (q+2) where q = p*p mults = [ [y+s,y+2*s..q] | (s,y)<- fs] fs' = [ (s,last ms) | ((s,_),ms)<- zip fs mults] a = accumArray (\a b->False) True (x,q-2) [(i,()) | ms<- mults, i<- ms]
Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't feel right to me somehow,
I fully agree.
:) :) :) :) :)
especially the claim that going by primes squares is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic
complexity *and* reduces the constant facors significantly. Exactly! Seeing this claim was just incredible to me. I've spent a considerable time when I first learned Haskell, tweaking the SICP code (as I remembred it; probably very similar to Turner's) until coming up with an equivalent of the "postponed sieve" (some years ago, didn't know about "span" yet :) ). But I assumed that this result was well known. Turner's sieve should long be regarded as _specification_, not an actual _code_, I thought. I think what happened was that Melissa O'Neill thought about the mutable storage i.e. "imperative" implementation when she said that, where numbers do get "crossed off" from the same "canvas". But here in functional code we don't "cross off" no numbers; we deal with numbers supply and filtering and merging, and nested function calls with their overhead etc., which costs can't be just ignored. IOW there's no "crossing off" done by any of extra filters, which nevertheless are all VERY busy, doing nothing. _Not_ "crossing" the multiples "off".
what with the postponed filters (which serves as the framework for all the other variants) achieving the orders of magnitude speedup and cutting the Turner's O(n^2) right down to O(n^1.5) just by doing that squares optimization (with the final version hovering around 1.24..1.17 in the tested range). The Euler's sieve being a special case of Eratosthenes's, too, doesn't let credence to claims that only the PQ version is somehow uniquely authentic and "faithful" to it.
I never found that claim convincing either.
I think what got crossed probably was "faithful to the original algorithm", with "faithful" to its typical imperative mutable storage implementation, as in "having same _complexity_". In that sense of course, linear merging is worse; it has worse complexity than "C", but is nevertheless faithful to the original algorithm, only under the functional setting. It is worse because of linear nature of lists, and it is all too easy to overlook the possibility of tree folding and jump to the conclusion that one needs a specialized data structure for that... But the article didn't even get to that part; instead it was all about proving rigorously that divisibility testing for primes is very costly (without actually formulating this conclusion). It was frustrating to read that "details of what gets crossed off and how, matter" without these details being actually spelled out - simply, that primes shouldn't be tested at all. That's the real insight of the article, IMO.
Turner's sieve should have been always looked at as just a specification, not a code, anyway, and actually running it is ridiculous. Postponed filters version, is the one to be used as a reference point of the basic _code_, precisely because it _does_ use the primes squares optimization, which _is_ essential to any basic sieve.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe