Re: Code and Perf. Data for Prime Finders (was: Genuine Eratosthenes sieve)

Here is another prime sieve. It is about half the length of the fastest contributed code (ONeillPrimes) and nearly as fast until it blows up on garbage collection:
% cat ONeillPrimes.hs | grep -v "^--" | wc 185 1105 6306 % cat BayerPrimes.hs | grep -v "^--" | wc 85 566 2418
[Integer] -O 1*10^6 2*10^6 3*10^6 4*10^6 5*10^6 --------------------------------------------------------- ONeillPrimes | 3.555 | 7.798 | 12.622 | 18.927 | 23.529 BayerPrimes | 3.999 | 8.895 | 18.003 | 22.977 | 38.053
I wrote this as an experiment in coding data structures in Haskell's lazy evaluation model, rather than as explicit data. The majority of the work done by this code is done by "merge"; the multiples of each prime percolate up through a tournament consisting of a balanced tree of suspended "merge" function calls. In order to create an infinite lazy balanced tree of lists, the datatype
data List a = A a (List a) | B [a]
is used as scaffolding. One thinks of the root of the infinite tree as starting at the leftmost child, and crawling up the left spine as necessary. What I'm calling a "venturi"
venturi :: Ord a => [[a]] -> [a]
merges an infinite list of infinite lists into one list, under the assumption that each list, and the heads of the lists, are in increasing order. This could be a generally useful function. If one can think of a better way to write "venturi", swapping in your code would in particular yield a faster prime sieve. I found that a tertiary merge tree was faster than a binary merge tree, because this leads to fewer suspensions. One can speed this code up a bit by interleaving strict and lazy calls, but I prefer to leave the code short and readable. It is bizarre that
trim p = let f m x = mod x m /= 0 in filter (f p)
lurks in the prime sieve code, but it is only used with primes of size up to the square root of the largest output prime. I tried more thoughtful alternatives, and they all slowed down the sieve. Sometimes dumb is beautiful. Thanks to apfelmus for various helpful remarks that lead me to think of this approach. Here is the code:
module BayerPrimes (venturi,primes) where
-- Code for venturi :: Ord a => [[a]] -> [a]
merge :: Ord a => [a] -> [a] -> [a] -> [a] merge xs@(x:xt) ys@(y:yt) zs@(z:zt) | x <= y = if x <= z then x : (merge xt ys zs) else z : (merge xs ys zt) | otherwise = if y <= z then y : (merge xs yt zs) else z : (merge xs ys zt) merge _ _ _ = undefined
data List a = A a (List a) | B [a]
mergeA :: Ord a => List a -> List a -> List a -> List a mergeA (A x xt) ys zs = A x (mergeA xt ys zs) mergeA (B xs) ys zs = mergeB xs ys zs
mergeB :: Ord a => [a] -> List a -> List a -> List a mergeB xs@(x:xt) ys@(A y yt) zs = case compare x y of LT -> A x (mergeB xt ys zs) EQ -> A x (mergeB xt yt zs) GT -> A y (mergeB xs yt zs) mergeB xs (B ys) zs = mergeC xs ys zs mergeB _ _ _ = undefined
mergeC :: Ord a => [a] -> [a] -> List a -> List a mergeC xs@(x:xt) ys@(y:yt) zs@(A z zt) | x < y = if x < z then A x (mergeC xt ys zs) else A z (mergeC xs ys zt) | otherwise = if y < z then A y (mergeC xs yt zs) else A z (mergeC xs ys zt) mergeC xs ys (B zs) = B $ merge xs ys zs mergeC _ _ _ = undefined
root :: Ord a => List a -> [List a] -> [a] root (A x xt) yss = x : (root xt yss) root (B xs) (ys:zs:yst) = root (mergeB xs ys zs) yst root _ _ = undefined
wrap :: [a] -> List a wrap [] = B [] wrap (x:xt) = A x $ B xt
triple :: Ord a => [List a] -> [List a] triple (x:y:z:xs) = mergeA x y z : (triple xs) triple _ = undefined
group :: Ord a => [List a] -> [List a] group (x:y:xt) = x : y : (group $ triple xt) group _ = undefined
venturi :: Ord a => [[a]] -> [a] venturi (x:xt) = root (wrap x) $ group $ map wrap xt venturi _ = undefined
-- Code for primes :: Integral a => [a]
diff :: Ord a => [a] -> [a] -> [a] diff xs@(x:xt) ys@(y:yt) = case compare x y of LT -> x : (diff xt ys) EQ -> (diff xt yt) GT -> (diff xs yt) diff _ _ = undefined
trim :: Integral a => a -> [a] -> [a] trim p = let f m x = mod x m /= 0 in filter (f p)
seed :: Integral a => [a] seed = [2,3,5,7,11,13,17]
wheel :: Integral a => [a] wheel = drop 1 [ m*j + k | j <- [0..], k <- ws ] where m = foldr1 (*) seed ws = foldr trim [1..m] seed
multiples :: Integral a => [a] -> [[a]] multiples ws = map fst $ tail $ iterate g ([], ws) where g (_,ps@(p:pt)) = ([ m*p | m <- ps ], trim p pt) g _ = undefined
primes :: Integral a => [a] primes = seed ++ (diff wheel $ venturi $ multiples wheel)

Dave Bayer wrote:
Here is another prime sieve.
It's great to know people are still having fun with this stuff... I've added your implementation to the zipfile available at http://www.cs.hmc.edu/~oneill/code/haskell-primes.zip (FWIW, I added specializations for Int and Integer and also primesToNth and primesToLimit).
It is about half the length of the fastest contributed code (ONeillPrimes)
I think you must be comparing it against an older version of my code (the zip file above now contains the most up to date version). That older version actually contained two slightly different copies of the algorithm. The more recent version doesn't. FWIW, here are the statistics I see for lines of code (ignoring comments, blank lines, and compiler directives): ONeillPrimes: 91 lines, 750 words, 4111 chars, 75628 bytes in .o file BayerPrimes: 72 lines, 604 words, 2649 chars, 74420 bytes in .o file So, I'd say the difference is at best 25% in source size and 2% in final binary size. But in reality, a big chunk of my code is a general purpose heap/ priority-queue implementation. If priority queue operations were provided as a standard part of Haskell in the same way that lists and BSTs are, the statistics would be: ONeillPrimes: 47 lines, 331 words, 2039 chars
and nearly as fast
Your results are great! It actually beats early versions of my method, before I made a couple of tweaks to improve performance. But for the current version of my code, there is still a bit of a performance gap between our two methods. Here are the stats I get (ghc -O3, 2.4GHz x86): 1*10^6 | 2*10^6 | 3*10^6 | 4*10^6 | 5*10^6 | 6*10^6 -------+--------+--------+--------+--------+--------- ONeillPrimes 1.36 | 3.08 | 4.98 | 6.98 | 9.05 | 11.21 ONeillPrimes* 1.35 | 3.07 | 4.94 | 6.93 | 8.99 | 11.14 BayerPrimes 2.18 | 4.49 | 8.99 | 11.18 | 16.60 | 25.77 The "*" version is one that uses ``calcPrimes()'' rather than ``primes'' to provide its list of primes, and thereby avoids needless remembering of the whole list of primes and all the memory that entails.
until it blows up on garbage collection:
I think that is the biggest issue with many of the other prime generators. At a glance (just looking at RSS with Unix's top command), your space usage seems like its about double mine. And when I use ``calcPrimes()'' rather than ``primes'' I barely need any space at all (O(sqrt(N)) at the point where I've calculated N primes. The difference there is striking -- a couple of MB vs hundreds. Anyway, fun stuff... Melissa.

Hi
But for the current version of my code, there is still a bit of a performance gap between our two methods. Here are the stats I get (ghc -O3, 2.4GHz x86):
Are you aware that -O3 is slower than -O2 and -O in ghc? If you want "fast code" then specify -O2, not -O3. Thanks Neil

Neil Mitchell wrote:
-O3 is slower than -O2 and -O in ghc? If you want "fast code" then specify -O2, not -O3.
Oops. That ought to have been -O2. But from what I can tell, -O3 is harmless (at least in this case). Both invocations generate identical executables, at least for these examples on my machine. Melissa.

Dave Bayer wrote:
What I'm calling a "venturi"
venturi :: Ord a => [[a]] -> [a]
merges an infinite list of infinite lists into one list, under the assumption that each list, and the heads of the lists, are in increasing order.
I wrote this as an experiment in coding data structures in Haskell's lazy evaluation model, rather than as explicit data. The majority of the work done by this code is done by "merge"; the multiples of each prime percolate up through a tournament consisting of a balanced tree of suspended "merge" function calls. In order to create an infinite lazy balanced tree of lists, the datatype
data List a = A a (List a) | B [a]
is used as scaffolding. One thinks of the root of the infinite tree as starting at the leftmost child, and crawling up the left spine as necessary.
After some pondering, the List a data structure for merging is really ingenious! :) Here's a try to explain how it works: The task is to merge a number of sorted and infinite lists when it's known that their heads are in increasing order. In particular, we want to write primes = (2:) $ diff [3..] $ venturi $ map multiple primes Thus, we have a (maybe infinite) list xss = [xs1, xs2, xs3, ...] of infinite lists with the following properties all sorted xss sorted (map head xss) where sorted is a function that returns True if the argument is a sorted list. A first try to implement the merging function is venturi xss = foldr1 merge xss = xs1 `merge` (xs2 `merge` (xs3 `merge` ... where merge is the standard function to merge to sorted lists. However, there are two problems. The first problem is that this doesn't work for infinite lists since merge is strict in both arguments. But the property head xs1 < head xs2 < head xs3 < ... we missed to exploit yet can now be used in the following way venturi xss = foldr1 merge' xss merge' (x:xt) ys = x : merge xt ys In other words, merge' is biased towards the left element merge' (x:_|_) _|_ = x : _|_ which is correct since we know that (head xs < head ys). The second problem is that we want the calls to merge to be arranged as a balanced binary tree since that gives an efficient heap. It's not so difficult to find a good shape for the infinite tree, the real problem is to adapt merge' to this situation since it's not associative: merge' xs (merge' (y:_|_) _|_) = merge' xs (y:_|_) = x1:x2:..:y:_|_ =/= merge' (merge' xs (y:_|_)) _|_ = merge' (x1:x2:...:y:_|_) _|_ = x1:_|_ The problem is that the second form doesn't realize that y is also smaller than the third argument. In other words, the second form has to treat more than one element as "privileged", namely x1,x2,... and y. This can be done with the aforementioned list data structure data People a = VIP a (People a) | Crowd [a] The people (VIPs and crowd) are assumed to be _sorted_. Now, we can start to implement merge' :: Ord a => People a -> People a -> People a The first case is merge' (VIP x xt) ys = VIP x (merge' xt ys) In other words, the invariant is that every VIP on the left of a merge' is guaranteed to be smaller than anyone on the right and thus will be served first. The next case merge' (Crowd xs) (Crowd ys) = Crowd (merge xs ys) is clear since it doesn't involve the invariant. What about the last case merge' (Crowd xs) (VIP y yt) = ?? Here, someone from the crowd xs may be smaller than y. But should we return a crowd or a VIP? The crucial answer is to always return a VIP merge' xs@(Crowd (x:xt)) ys@(VIP y yt) | x <= y = VIP x (merge' (Crowd xt) ys) | otherwise = VIP y (merge' xs yt) because doing otherwise would turn a VIP into a member of some crowd. But turning x into a VIP is no problem since that doesn't violated the invariant. Now merge' is associative and everything works as we want. I think that this approach has the potential to outperform O'Neill's (old?) code because it doesn't incorporates another prime number to the sieve mechanism until it really has to. I mean the following: in the 1st, 2nd, 3rd, 4th, ... step, O'Neill's code adds the multiples 2*3, 3*3, 5*5, 7*7, ... to the priority queue and uses them to sieve for potential prime numbers from then on. But the approach here adds 5*5=25 to the heap only after considering the 9th prime 23.
trim p = let f m x = mod x m /= 0 in filter (f p)
lurks in the prime sieve code, but it is only used with primes of size up to the square root of the largest output prime. I tried more thoughtful alternatives, and they all slowed down the sieve. Sometimes dumb is beautiful.
I still think that filtering the wheel when generating multiples is wrong. In fact, most of the algorithmic work would be done here if there wasn't the lucky coincidence that "it is only used with primes of size up to the square root of the largest output prime". You are saved to some extend by the fact that the cleverly constructed heap doesn't force the multiples of prime p until it starts considering candidate primes >= p*p. Also, large parts of the filtered lists have to be kept in memory since subsequent primes need to filter them further. Using a fresh throw-away wheel for every prime eliminates this memory eater.
until it blows up on garbage collection
Maybe the major reason for this is that many parts of shape of the ternary tree are forced too early. In other words, irrefutable pattern matches in triple , group and root should do the job. Also, using explicitly infinite lists is probably better. -- an explicitly infinite list data Stream a = a :> Stream a triples ~(x :> ~(y :> ~(z :> zs))) = mergeA x y z : triples zs group ~(x :> ~(y :> ys)) = x :> y :> group (triples ys) root (A x xt) yss = x :> (root xt yss) root (B xs) ~(ys :> ~(zs :> zss) = root (mergeB xs ys zs) zss I'm not sure whether it's really necessary to establish a perfectly balanced tree with root . Maybe it's better to have the smaller prime multiples near the top of the tree since they yield the minimum more often. In other words, I'd simply use venturi ~(xs :> ~(ys :> yss)) = mergeA xs ys (venturi $ triples yss)
Thanks to apfelmus for various helpful remarks that lead me to think of this approach. λ(^_^)λ
Regards, apfelmus

apfelmus wrote:
After some pondering, the List a data structure for merging is really ingenious! :) Here's a try to explain how it works:
Thanks apfelmus! A detailed explanation of this code is really helpful for anyone trying to understand what is going on. The VIP/ Crowd analogy is also very nice.
I think that this approach has the potential to outperform O'Neill's (old?) code because it doesn't incorporates another prime number to the sieve mechanism until it really has to. I mean the following: in the
1st, 2nd, 3rd, 4th, ... step,
O'Neill's code adds the multiples
2*2, 3*3, 5*5, 7*7, ...
to the priority queue and uses them to sieve for potential prime numbers from then on.
Yeah, that's (only) what the code in my paper does -- it's good enough for explicative purposes, but all recent versions have used a slightly augmented priority queue. It's a priority queue coupled with a "feeder list" that provides entries to add to the queue (in increasing order). They are only admitted to the heap data structure only once when the root of the heap "gets there". The two most important bits are: type HybridQ k v = (PriorityQ k v, [(k,v)]) -- postRemoveHQ is called when the min element of the heap has changed postRemoveHQ :: Ord k => HybridQ k v -> HybridQ k v postRemoveHQ mq@(pq, []) = mq postRemoveHQ mq@(pq, (qk,qv) : qs) | qk < minKeyPQ pq = (insertPQ qk qv pq, qs) | otherwise = mq (See ONeillPrimes.hs in http://www.cs.hmc.edu/~oneill/code/haskell- primes.zip for the complete code. I've also added Thorkil Naur's code from March, which is also a good performer, although its another case where most readers would find a detailed explanation of the code instructive.)
the approach here adds 5*5=25 to the heap only after considering the 9th prime 23.
Yep, that's what mine does too. Best Regards, Melissa.

Hello Melissa, On Tuesday 24 July 2007 19:09, Melissa O'Neill wrote:
apfelmus wrote:
After some pondering, the List a data structure for merging is really ingenious! :) Here's a try to explain how it works:
Thanks apfelmus! A detailed explanation of this code is really helpful for anyone trying to understand what is going on. The VIP/ Crowd analogy is also very nice.
I think that this approach has the potential to outperform O'Neill's (old?) code because it doesn't incorporates another prime number to the sieve mechanism until it really has to. I mean the following: in the
1st, 2nd, 3rd, 4th, ... step,
O'Neill's code adds the multiples
2*2, 3*3, 5*5, 7*7, ...
to the priority queue and uses them to sieve for potential prime numbers from then on.
Yeah, that's (only) what the code in my paper does -- it's good enough for explicative purposes, but all recent versions have used a slightly augmented priority queue. It's a priority queue coupled with a "feeder list" that provides entries to add to the queue (in increasing order). They are only admitted to the heap data structure only once when the root of the heap "gets there".
The two most important bits are:
type HybridQ k v = (PriorityQ k v, [(k,v)])
-- postRemoveHQ is called when the min element of the heap has changed postRemoveHQ :: Ord k => HybridQ k v -> HybridQ k v postRemoveHQ mq@(pq, []) = mq postRemoveHQ mq@(pq, (qk,qv) : qs) | qk < minKeyPQ pq = (insertPQ qk qv pq, qs) | otherwise = mq
(See ONeillPrimes.hs in http://www.cs.hmc.edu/~oneill/code/haskell- primes.zip for the complete code. I've also added Thorkil Naur's code from March, which is also a good performer,
Do you have detailed measurements that you wish to share? I would be most interested, I assure you.
although its another case where most readers would find a detailed explanation of the code instructive.)
I'll do my very best to provide such an explanation within, say, the next couple of weeks.
the approach here adds 5*5=25 to the heap only after considering the 9th prime 23.
Yep, that's what mine does too.
Best Regards,
Melissa.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Thanks a lot and the best regards Thorkil

Hello, On Wednesday 25 July 2007 01:42, Thorkil Naur wrote:
Hello Melissa,
On Tuesday 24 July 2007 19:09, Melissa O'Neill wrote:
... (See ONeillPrimes.hs in http://www.cs.hmc.edu/~oneill/code/haskell- primes.zip for the complete code. I've also added Thorkil Naur's code from March, which is also a good performer,
Do you have detailed measurements that you wish to share? I would be most interested, I assure you.
although its another case where most readers would find a detailed explanation of the code instructive.)
I'll do my very best to provide such an explanation within, say, the next couple of weeks. ...
And now that time has come, so brace yourselves. For your convenience, my "code from March" is thorkilnaur.dk/~tn/T64_20070303_1819.tar.gz See also a preliminary description in http://www.haskell.org/pipermail/haskell-cafe/2007-March/023095.html. The new explanation is here: thorkilnaur.dk/~tn/Haskell/EratoS/EratoS2.txt Best regards Thorkil

apfelmus
Dave Bayer wrote:
What I'm calling a "venturi"
venturi :: Ord a => [[a]] -> [a]
merges an infinite list of infinite lists into one list, under the assumption that each list, and the heads of the lists, are in increasing order.
I wrote this as an experiment in coding data structures in Haskell's lazy evaluation model, rather than as explicit data. The majority of the work done by this code is done by "merge"; the multiples of each prime percolate up through a tournament consisting of a balanced tree of suspended "merge" function calls. In order to create an infinite lazy balanced tree of lists, the datatype
data List a = A a (List a) | B [a]
is used as scaffolding. One thinks of the root of the infinite tree as starting at the leftmost child, and crawling up the left spine as necessary.
After some pondering, the List a data structure for merging is really ingenious! :) Here's a try to explain how it works:
The task is to merge a number of sorted and infinite lists when it's known that their heads are in increasing order. In particular, we want to write
primes = (2:) $ diff [3..] $ venturi $ map multiple primes
Thus, we have a (maybe infinite) list
xss = [xs1, xs2, xs3, ...]
of infinite lists with the following properties
all sorted xss sorted (map head xss)
where sorted is a function that returns True if the argument is a sorted list. A first try to implement the merging function is
venturi xss = foldr1 merge xss = xs1 `merge` (xs2 `merge` (xs3 `merge` ...
where merge is the standard function to merge to sorted lists.
However, there are two problems. The first problem is that this doesn't work for infinite lists since merge is strict in both arguments. But the property head xs1 < head xs2 < head xs3 < ... we missed to exploit yet can now be used in the following way
venturi xss = foldr1 merge' xss
merge' (x:xt) ys = x : merge xt ys
In other words, merge' is biased towards the left element
merge' (x:_|_) _|_ = x : _|_
which is correct since we know that (head xs < head ys).
The second problem is that we want the calls to merge to be arranged as a balanced binary tree since that gives an efficient heap. It's not so difficult to find a good shape for the infinite tree, the real problem is to adapt merge' to this situation since it's not associative:
......
The problem is that the second form doesn't realize that y is also smaller than the third argument. In other words, the second form has to treat more than one element as "privileged", namely x1,x2,... and y. This can be done with the aforementioned list data structure
data People a = VIP a (People a) | Crowd [a]
The people (VIPs and crowd) are assumed to be _sorted_. Now, we can start to implement
merge' :: Ord a => People a -> People a -> People a
Hi, ... replying to a two-years-old post here, :) :) and after consulting the full "VIP" version in haskellwiki/Prime_Numers#Implicit_Heap ... It is indeed the major problem with the merged multiples removing code (similar one to Richard Bird's code from Melissa O'Neill's JFP article) - the linear nature of foldr, requiring an (:: a->b->b) merge function. To make it freely composable to rearrange the list into arbitrary form tree it must indeed be type uniform (:: a->a->a) first, and associative second. The structure of the folded tree should be chosen to better suit the primes multiples production. I guestimate the total cost as Sum (1/p)*d, where p is a generating prime at the leaf, and d the leaf's depth, i.e. the amount of merge nodes its produced multiple must pass on its way to the top. The structure used in your VIP code, 1+(2+(4+(8+...))), can actually be improved upon with another, (2+4)+( (4+8)+( (8+16)+...)), for which the estimated cost is about 10%-12% lower. This can be expressed concisely as the following: primes :: () -> [Integer] primes () = 2:primes' where primes' = [3,5] ++ drop 2 [3,5..] `minus` comps mults = map (\p-> fromList [p*p,p*p+2*p..]) $ primes' (comps,_) = tfold mergeSP (pairwise mergeSP mults) fromList (x:xs) = ([x],xs) tfold f (a: ~(b: ~(c:xs))) = (a `f` (b `f` c)) `f` tfold f (pairwise f xs) pairwise f (x:y:ys) = f x y : pairwise f ys mergeSP (a,b) ~(c,d) = let (bc,b') = spMerge b c in (a ++ bc, merge b' d) where spMerge u@(x:xs) w@(y:ys) = case compare x y of LT -> (x:c,d) where (c,d) = spMerge xs w EQ -> (x:c,d) where (c,d) = spMerge xs ys GT -> (y:c,d) where (c,d) = spMerge u ys spMerge u [] = ([], u) spMerge [] w = ([], w) with ''merge'' and ''minus'' defined in the usual way. Its run times are indeed improved 10%-12% over the VIP code from the haskellwiki page. Testing was done by running the code, interpreted, inside GHCi. The ordered "split pairs" representing ordered lists here as pairs of a known, finite (so far) prefix and the rest of list, form a _monoid_ under mergeSP. Or with wheel, primes :: () -> [Integer] primes () = 2:3:5:7:primes' where primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes' (comps,_) = tfold mergeSP (pairwise mergeSP mults) fromList (x:xs) = ([x],xs) rollFrom n = let x = (n-11) `mod` 210 (y,_) = span (< x) wheelSums in roll n $ drop (length y) wheel wheelSums = roll 0 wdiffs roll = scanl (+) wheel = wdiffs ++ wheel wdiffs = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2: 4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wdiffs Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package, with about half used memory reported, in producing 10,000 to 300,000 primes. It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code.
participants (6)
-
apfelmus
-
Dave Bayer
-
Melissa O'Neill
-
Neil Mitchell
-
Thorkil Naur
-
Will Ness