Simple FAST lazy functional primes

First, here it is: primes = 2: 3: sieve 0 primes' 5 primes' = tail primes sieve k (p:ps) x = [x | x<-[x,x+2..p*p-2], and [(x`rem`p)/=0 | p<-take k primes']] ++ sieve (k+1) ps (p*p+2) (thanks to Leon P.Smith for his brilliant idea of directly generating the spans of odds instead of calling 'span' on the infinite odds list). This code is faster than PriorityQueue-based sieve (granted, using my own ad-hoc implementation, so YMMV) in producing the first million primes (testing was done running the ghc -O3 compiled code inside GHCi). The relative performance vs the PQ-version is: 100,000 300,000 1,000,000 primes produced 0.6 0.75 0.97 I've recently came to the Melissa O'Neill's article (I know, I know) and she makes the incredible claim there that the square-of-prime optimization doesn't count if we want to improve the old and "lazy" uprimes = 2: sieve [3,5..] where sieve (p:xs) = p : sieve [x | x <- xs, x `mod` p > 0] Her article gave me the strong impression that she claims that the only way to fix this code is by using Priority Queue based optimization, and then goes on to present astronomical gains in speed by implementing it. Well, I find this claim incredible. First of all, the "naive" code fires up its nested filters much too early, when in fact each must be started only when the prime's square is reached (not only have its work started there - be started there itself!), so that filters are delayed: dprimes = 2: 3: sieve (tail dprimes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps (filter ((/=0).(`rem`p)) (tail t)) where (h,t)=span (< p*p) xs This code right there is on par with the PQ-base code, only x3-4 slower at generating the first million primes. Second, the implicit control structure of linearly nested filters, piling up in front of the supply stream, each with its prime-number-it-filters-by hidden inside it, needs to be explicated into a data structure of primes-to-filter-by (which is in fact the prefix of the primes list we're building itself), so the filtering could be done by one function call instead of many nested calls: xprimes = 2: 3: sieve 0 primes' [5,7..] where primes' = tail xprimes sieve k (p:ps) xs = noDivs k h ++ sieve (k+1) ps (tail t) where (h,t)=span (< p*p) xs noDivs k = filter (\x-> all ((/=0).(x`rem`)) (take k primes'))
From here, using the brilliant idea of Leon P. Smith's of fusing the span and the infinite odds supply, we arrive at the final version,
kprimes = 2: 3: sieve 0 primes' 5 where primes' = tail kprimes sieve k (p:ps) x = noDivs k h ++ sieve (k+1) ps (t+2) where (h,t)=([x,x+2..t-2], p*p) noDivs k = filter (\x-> all ((/=0).(x`rem`)) (take k primes')) Using the list comprehension syntax, it turns out, when compiled with ghc -O3, gives it another 5-10% speedup itself. So I take it to disprove the central premise of the article, and to show that simple lazy functional FAST primes code does in fact exist, and that the PQ optimization - which value of course no-one can dispute - is a far-end optimization.

You can remove the "take k" step by passing along the list of primes smaller than p instead of k: primes = 2 : 3 : 5 : 7 : sieve [3, 2] (drop 2 primes) sieve qs@(q:_) (p:ps) = [x | x<-[q*q+2,q*q+4..p*p-2], and [(x`rem`p)/=0 | p<-qs]] ++ sieve (p:qs) ps I also removed the "x" parameter by generating it from the previous prime. But this also means you have to start at p=5 and q=3, because you want q*q+2 to be odd. Sjoerd On Nov 2, 2009, at 8:41 AM, Will Ness wrote:
First, here it is:
primes = 2: 3: sieve 0 primes' 5 primes' = tail primes sieve k (p:ps) x = [x | x<-[x,x+2..p*p-2], and [(x`rem`p)/=0 | p<-take k primes']] ++ sieve (k+1) ps (p*p+2)
(thanks to Leon P.Smith for his brilliant idea of directly generating the spans of odds instead of calling 'span' on the infinite odds list).
This code is faster than PriorityQueue-based sieve (granted, using my own ad-hoc implementation, so YMMV) in producing the first million primes (testing was done running the ghc -O3 compiled code inside GHCi). The relative performance vs the PQ-version is:
100,000 300,000 1,000,000 primes produced 0.6 0.75 0.97
I've recently came to the Melissa O'Neill's article (I know, I know) and she makes the incredible claim there that the square-of-prime optimization doesn't count if we want to improve the old and "lazy"
uprimes = 2: sieve [3,5..] where sieve (p:xs) = p : sieve [x | x <- xs, x `mod` p > 0]
Her article gave me the strong impression that she claims that the only way to fix this code is by using Priority Queue based optimization, and then goes on to present astronomical gains in speed by implementing it.
Well, I find this claim incredible. First of all, the "naive" code fires up its nested filters much too early, when in fact each must be started only when the prime's square is reached (not only have its work started there - be started there itself!), so that filters are delayed:
dprimes = 2: 3: sieve (tail dprimes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps (filter ((/=0).(`rem`p)) (tail t)) where (h,t)=span (< p*p) xs
This code right there is on par with the PQ-base code, only x3-4 slower at generating the first million primes.
Second, the implicit control structure of linearly nested filters, piling up in front of the supply stream, each with its prime-number-it-filters- by hidden inside it, needs to be explicated into a data structure of primes-to-filter-by (which is in fact the prefix of the primes list we're building itself), so the filtering could be done by one function call instead of many nested calls:
xprimes = 2: 3: sieve 0 primes' [5,7..] where primes' = tail xprimes sieve k (p:ps) xs = noDivs k h ++ sieve (k+1) ps (tail t) where (h,t)=span (< p*p) xs noDivs k = filter (\x-> all ((/=0).(x`rem`)) (take k primes'))
From here, using the brilliant idea of Leon P. Smith's of fusing the span and the infinite odds supply, we arrive at the final version,
kprimes = 2: 3: sieve 0 primes' 5 where primes' = tail kprimes sieve k (p:ps) x = noDivs k h ++ sieve (k+1) ps (t+2) where (h,t)=([x,x+2..t-2], p*p) noDivs k = filter (\x-> all ((/=0).(x`rem`)) (take k primes'))
Using the list comprehension syntax, it turns out, when compiled with ghc -O3, gives it another 5-10% speedup itself.
So I take it to disprove the central premise of the article, and to show that simple lazy functional FAST primes code does in fact exist, and that the PQ optimization - which value of course no-one can dispute - is a far-end optimization.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Sjoerd Visscher sjoerd@w3future.com

Excuse me, 2 doesn't have to be in the list of smaller primes, as we're only generating odd numbers: primes = 2 : 3 : 5 : 7 : sieve [3] (drop 2 primes) sieve qs@(q:_) (p:ps) = [x | x<-[q*q+2,q*q+4..p*p-2], and [(x`rem`p)/=0 | p<-qs]] ++ sieve (p:qs) ps Sjoerd On Nov 2, 2009, at 2:07 PM, Sjoerd Visscher wrote:
You can remove the "take k" step by passing along the list of primes smaller than p instead of k:
primes = 2 : 3 : 5 : 7 : sieve [3, 2] (drop 2 primes) sieve qs@(q:_) (p:ps) = [x | x<-[q*q+2,q*q+4..p*p-2], and [(x`rem`p)/=0 | p<-qs]] ++ sieve (p:qs) ps
I also removed the "x" parameter by generating it from the previous prime. But this also means you have to start at p=5 and q=3, because you want q*q+2 to be odd.
Sjoerd
On Nov 2, 2009, at 8:41 AM, Will Ness wrote:
First, here it is:
primes = 2: 3: sieve 0 primes' 5 primes' = tail primes sieve k (p:ps) x = [x | x<-[x,x+2..p*p-2], and [(x`rem`p)/=0 | p<-take k primes']] ++ sieve (k+1) ps (p*p+2)
(thanks to Leon P.Smith for his brilliant idea of directly generating the spans of odds instead of calling 'span' on the infinite odds list).
This code is faster than PriorityQueue-based sieve (granted, using my own ad-hoc implementation, so YMMV) in producing the first million primes (testing was done running the ghc -O3 compiled code inside GHCi). The relative performance vs the PQ-version is:
100,000 300,000 1,000,000 primes produced 0.6 0.75 0.97
I've recently came to the Melissa O'Neill's article (I know, I know) and she makes the incredible claim there that the square-of-prime optimization doesn't count if we want to improve the old and "lazy"
uprimes = 2: sieve [3,5..] where sieve (p:xs) = p : sieve [x | x <- xs, x `mod` p > 0]
Her article gave me the strong impression that she claims that the only way to fix this code is by using Priority Queue based optimization, and then goes on to present astronomical gains in speed by implementing it.
Well, I find this claim incredible. First of all, the "naive" code fires up its nested filters much too early, when in fact each must be started only when the prime's square is reached (not only have its work started there - be started there itself!), so that filters are delayed:
dprimes = 2: 3: sieve (tail dprimes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps (filter ((/=0).(`rem`p)) (tail t)) where (h,t)=span (< p*p) xs
This code right there is on par with the PQ-base code, only x3-4 slower at generating the first million primes.
Second, the implicit control structure of linearly nested filters, piling up in front of the supply stream, each with its prime-number-it- filters-by hidden inside it, needs to be explicated into a data structure of primes-to-filter-by (which is in fact the prefix of the primes list we're building itself), so the filtering could be done by one function call instead of many nested calls:
xprimes = 2: 3: sieve 0 primes' [5,7..] where primes' = tail xprimes sieve k (p:ps) xs = noDivs k h ++ sieve (k+1) ps (tail t) where (h,t)=span (< p*p) xs noDivs k = filter (\x-> all ((/=0).(x`rem`)) (take k primes'))
From here, using the brilliant idea of Leon P. Smith's of fusing the span and the infinite odds supply, we arrive at the final version,
kprimes = 2: 3: sieve 0 primes' 5 where primes' = tail kprimes sieve k (p:ps) x = noDivs k h ++ sieve (k+1) ps (t+2) where (h,t)=([x,x+2..t-2], p*p) noDivs k = filter (\x-> all ((/=0).(x`rem`)) (take k primes'))
Using the list comprehension syntax, it turns out, when compiled with ghc -O3, gives it another 5-10% speedup itself.
So I take it to disprove the central premise of the article, and to show that simple lazy functional FAST primes code does in fact exist, and that the PQ optimization - which value of course no-one can dispute - is a far-end optimization.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Sjoerd Visscher sjoerd@w3future.com
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Sjoerd Visscher sjoerd@w3future.com

Sjoerd Visscher
[...] 2 doesn't have to be in the list of smaller primes, as we're only generating odd numbers:
primes = 2 : 3 : 5 : 7 : sieve [3] (drop 2 primes) sieve qs@(q:_) (p:ps) = [x | x<-[q*q+2,q*q+4..p*p-2], and [(x`rem`p)/=0 | p<-qs]] ++ sieve (p:qs) ps
Sjoerd
Thanks! I haven't tested your code's speed yet, but have few points: i wouldn't expect eliminating a parameter to have any effect on performance in compiled code (I haven't a clue whether -O2 or -O3 should be used BTW) if it doesn't eliminate any superfluous computations. second, you prepend bigger primes in front of a list (building the prefix in reverse order); this will cause them to be tested against first. I think (1) we're supposed to test composites against smaller factors first, for the tests to stop sooner. IMO it'll slow the code down. I think (2) you can safely append them at the end instead, for a compiler should be able to just maintain a tail pointer there. But then you loose immediate access to the last found prime; so will need to return the 'x' parameter back. Then you end up with my code exactly :) (only replicating the prefix): primes = 2 : 3 : 5 : 7 : sieve [3] (drop 2 primes) 9 sieve pfx (p:ps) x = [x | x<-[x+2,x+4..p*p-2], and [(x`rem`p)/=0 | p<-pfx]] ++ sieve (pfx++[p]) ps (p*p) it'll be interesting to test my hypothesis (two of them :) ) and see if this has in fact better time than your code. thirdly, (take k) could be compiled out completely into a tail pointer too, maintained and advanced after each step. I would hope a smart compiler to do just that. Again, some more testing is in order. Although, I tested the two approaches on some previous incarnation of this code, and the (take k) vs (pfx++[p]) had exactly same run time (or was better). What I'm after mostly here, is for the code to be clear and simple, and reasonably efficient. Right now it corresponds very nicely to the description of the sieve as filtering all the odds testable by a known prefix of primes, and then going on to proceed with the next batch of them with a prefix that was grown by one more prime. And so on. But more importantly I want it to be known that there's a lot that can be done here, in a natural functional lazy kind of way, before resorting to priority queues and mutable arrays. We could always just use C too. ;) Thanks for the feedback! :)

Will Ness
But more importantly I want it to be known that there's a lot that can be done here, in a natural functional lazy kind of way, before resorting to priority queues and mutable arrays. We could always just use C too. ;)
I mean it as an introductory code that's nevertheless good for producing the first million primes or so - not the best sieve ever, but the best (is it?) simple clear functional introductory sieve, instead. That's another reason to using (take k primes') - it practically says it in English, "the first k odd primes". :)

Sjoerd Visscher
Excuse me, 2 doesn't have to be in the list of smaller primes, as we're only generating odd numbers:
primes = 2 : 3 : 5 : 7 : sieve [3] (drop 2 primes) sieve qs@(q:_) (p:ps) = [x | x<-[q*q+2,q*q+4..p*p-2], and [(x`rem`p)/=0 | p<-qs]] ++ sieve (p:qs) ps
Sjoerd
Hi, I've run it now. In producing 100,000 primes, your above code takes x3.5 more time than the one I posted. The code modified as I suggested with (qs++[p]) takes exactly the same time as mine. :) Cheers,

On Nov 2, 2009, at 5:11 PM, Will Ness wrote:
Sjoerd Visscher
writes: Excuse me, 2 doesn't have to be in the list of smaller primes, as we're only generating odd numbers:
primes = 2 : 3 : 5 : 7 : sieve [3] (drop 2 primes) sieve qs@(q:_) (p:ps) = [x | x<-[q*q+2,q*q+4..p*p-2], and [(x`rem`p)/=0 | p<-qs]] ++ sieve (p:qs) ps
Sjoerd
Hi,
I've run it now. In producing 100,000 primes, your above code takes x3.5 more time than the one I posted.
Too bad. The order of the primes when filtering matters quite a lot!
The code modified as I suggested with (qs++[p]) takes exactly the same time as mine. :)
Yes, append and take are both O(n). Here's another variation which I rather like: primes = 2 : 3 : sieve (tail primes) (notDivides 3) 5 sieve (p:ps) test from = [x | x <- [from, from + 2 .. p * p - 2], test x] ++ sieve ps (\x -> test x && notDivides p x) (p * p + 2) notDivides d x = (x `rem` d) /= 0 I'm curious if GHC can compile this to the same speed as your code. -- Sjoerd Visscher sjoerd@w3future.com

Sjoerd Visscher
On Nov 2, 2009, at 5:11 PM, Will Ness wrote:
Hi,
I've run it now. In producing 100,000 primes, your above code takes x3.5 more time than the one I posted.
Too bad. The order of the primes when filtering matters quite a lot!
Yep. Most of the composites will have mostly small prime factors, most of the time. :) An English paraphrase of the analysis from the paper, when she proves that whatever you do with composites, doesn't matter (in the _long_ run!) if you test primes for divisibility - because for primes we end up testing against _all_ of the primes in the prefix. This is exactly where her PQ code achieves its coup - it gets its primes for free when it checks its composites and sees the gap there. So the reason not to have these nested functions is to have all the logic represented in the open, in the data structure - here the list, still. PQ is a natural next step. Having a lot of linearly nested functions, there isn't a lot we can do with them. :)
The code modified as I suggested with (qs++[p]) takes exactly the same time as mine. :)
Yes, append and take are both O(n).
I would hope 'append' to be O(1), and 'take' just translated into an upper limit of the testing loop in 'and'.
Here's another variation which I rather like:
primes = 2 : 3 : sieve (tail primes) (notDivides 3) 5 sieve (p:ps) test from = [x | x <- [from, from + 2 .. p * p - 2], test x] ++ sieve ps (\x -> test x && notDivides p x) (p * p + 2) notDivides d x = (x `rem` d) /= 0
I'm curious if GHC can compile this to the same speed as your code.
It depends on whether it is as good at traversing function call frames at run time, as elements of a list. Or whether it uses the call frames at all, or is able to compile away all of it. :)

Will Ness
primes = 2: 3: sieve 0 primes' 5 primes' = tail primes sieve k (p:ps) x = [x | x <- [x,x+2..p*p-2], and [(x`rem`p)/=0 | p <- take k primes']] ++ sieve (k+1) ps (p*p+2)
(thanks to Leon P.Smith for his brilliant idea of directly generating the spans of odds instead of calling 'span' on the infinite odds list).
The relative performance vs the PQ-version is:
100,000 300,000 1,000,000 primes produced 0.6 0.75 0.97
One _crucial_ tidbit I've left out: _type_signature_. Adding (:: [Int]) speeds this code up more than TWICE! :) :) 'sieve' can also be used in e.g. primesFrom m = sieve (length h) ps m where (h,(_:ps)) = span (<= (floor.sqrt.fromIntegral) m) primes to get few big primes even faster. :)

On Mon, Nov 2, 2009 at 1:59 PM, Will Ness
Will Ness
writes: primes = 2: 3: sieve 0 primes' 5 primes' = tail primes sieve k (p:ps) x = [x | x <- [x,x+2..p*p-2], and [(x`rem`p)/=0 | p <- take k primes']] ++ sieve (k+1) ps (p*p+2)
(thanks to Leon P.Smith for his brilliant idea of directly generating the spans of odds instead of calling 'span' on the infinite odds list).
The relative performance vs the PQ-version is:
100,000 300,000 1,000,000 primes produced 0.6 0.75 0.97
One _crucial_ tidbit I've left out: _type_signature_.
Adding (:: [Int]) speeds this code up more than TWICE!
:) :)
If you are okay with Int, then maybe you're also happy with Int32 or Word32. If so, why don't you use template haskell to build the list at compile time? If you do that, then getting the kth prime at run-time is O(k). Take that AKS! :) Jason

Jason Dagit
On Mon, Nov 2, 2009 at 1:59 PM, Will Ness
wrote: Will Ness writes: One _crucial_ tidbit I've left out: _type_signature_. Adding (:: [Int]) speeds this code up more than TWICE! :) :)
If you are okay with Int, then maybe you're also happy with Int32 or Word32. If so, why don't you use template haskell to build the list at compile time? If you do that, then getting the kth prime at run-time is O(k). Take that AKS! :)
O(k), I've removed it since the post actually. Wasn't thinking clearly for a moment, having seen the double speedup! I've found Matthew Brecknell's fast code in old Melissa O'Neill's article here from 2007-02-19 18:14:23 GMT. Without the type signature, it's twice slower than my code. I think it is a fairly faithful translation now of what the sieve is all about, except that it tests its candidate numbers whereas sieve counts over them (and thus is able to skip over primes without testing them). The usual functional approach has it working with each number in isolation, so tests it (to recreate counting state in effect), thus overworks much on primes. Next logical step is to start counting! :)
Jason
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, Nov 2, 2009 at 2:40 PM, Will Ness
Jason Dagit
writes: On Mon, Nov 2, 2009 at 1:59 PM, Will Ness
wrote: Will Ness writes: One _crucial_ tidbit I've left out: _type_signature_. Adding (:: [Int]) speeds this code up more than TWICE! :) :)
If you are okay with Int, then maybe you're also happy with Int32 or Word32. If so, why don't you use template haskell to build the list at compile time? If you do that, then getting the kth prime at run-time is O(k). Take that AKS! :)
O(k), I've removed it since the post actually. Wasn't thinking clearly for a moment, having seen the double speedup!
By the way, do you understand where the speedup with Int is coming from? As I understand it, there are two main places. One is that the type class dictionary passing can be removed (GHC might figure this out already, I'd need to check the core to be sure). The other is that GHC is likely unboxing to it's primitive Int# type. If none of that made sense, or you'd like to know how you can double check what I'm talking about, then Real-World Haskell has a chapter with an example: http://book.realworldhaskell.org/read/profiling-and-optimization.html Good luck! Jason

By the way, do you understand where the speedup with Int is coming from? As I understand it, there are two main places. One is that the type class dictionary
Jason Dagit
[...] Good luck! Jason
Thanks! Writing the super-fast sieve wasn't my objective here though. It rather was writing the fastest possible simple functional lazy code true to the sieve's definition, and understanding it better that way (that's the added bonus). As it stands now, this code seems a rather faithful description of what _is_ sieve, except that it tests each number in isolation instead of counting over a bunch of them at once (skipping over primes, getting them for free). THAT's the crucial difference, which the article seems trying to explain but never quite gets it in such simple terms. All the extra activity is kept to absolute minimum here, and _now_ the main thing can be dealt with further, if so desired - like turning to using the PQ thing, etc. Then if we were to compare them, it wouldn't be like comparing apples with orange juice. :) That's what it felt like, seeing the PQ code compared with the classic "naive" version in that article. I'm reasonably sure that PQ-augmented, this code will be even faster, not slower, even for the first million primes. This whole experience proves it that the clearest code can also be the fastest (and may be necessarily so). Seeing it described in that article as if clarity must be paid for with efficiency (and vice versa), didn't seem right to me. Cheers,
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, Nov 02, 2009 at 02:10:12PM -0700, Jason Dagit wrote:
If you are okay with Int, then maybe you're also happy with Int32 or Word32. If so, why don't you use template haskell to build the list at compile time? If you do that, then getting the kth prime at run-time is O(k). Take that AKS! :)
Silly you! If he's into TH than he may just create an -- god forbid -- array and get the kth prime in O(1). :) -- Felipe.
participants (5)
-
Bulat Ziganshin
-
Felipe Lessa
-
Jason Dagit
-
Sjoerd Visscher
-
Will Ness