newbie optimization question

For the purposes of learning, I am trying to optimize some variation of the following code for computing all perfect numbers less than 10000. divisors i = [j | j<-[1..i-1], i `mod` j == 0] main = print [i | i<-[1..10000], i == sum (divisors i)] I know this is mathematically stupid, but the point is to do a moderate nested-loops computation. On my 2.33GHz dual-core MacBookPro, the obvious C program takes about .3 seconds, and a compiled OCaML program (tail recursion, no lists) about .33 seconds. The above takes about 4 seconds. I've tried using foldl', and doing explicit tail recursion with strict accumulators, but I can't get the running time below 3 seconds. Is it possible to come within striking distance of the other languages? Thanks. --PR

On 10/28/07, Prabhakar Ragde
For the purposes of learning, I am trying to optimize some variation of the following code for computing all perfect numbers less than 10000.
divisors i = [j | j<-[1..i-1], i `mod` j == 0] main = print [i | i<-[1..10000], i == sum (divisors i)]
I know this is mathematically stupid, but the point is to do a moderate nested-loops computation. On my 2.33GHz dual-core MacBookPro, the obvious C program takes about .3 seconds, and a compiled OCaML program (tail recursion, no lists) about .33 seconds. The above takes about 4 seconds.
You could try giving divisors type signature: divisors :: Int -> [Int]
I've tried using foldl', and doing explicit tail recursion with strict accumulators, but I can't get the running time below 3 seconds. Is it possible to come within striking distance of the other languages? Thanks. --PR _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Jaak Randmets wrote:
On 10/28/07, Prabhakar Ragde
wrote: For the purposes of learning, I am trying to optimize some variation of the following code for computing all perfect numbers less than 10000.
divisors i = [j | j<-[1..i-1], i `mod` j == 0] main = print [i | i<-[1..10000], i == sum (divisors i)]
I know this is mathematically stupid, but the point is to do a moderate nested-loops computation. On my 2.33GHz dual-core MacBookPro, the obvious C program takes about .3 seconds, and a compiled OCaML program (tail recursion, no lists) about .33 seconds. The above takes about 4 seconds.
You could try giving divisors type signature: divisors :: Int -> [Int]
Thank you. That brings the time down to 0.5 seconds. I'm glad it was something as simple as that. --PR

Prabhakar Ragde wrote:
You could try giving divisors type signature: divisors :: Int -> [Int]
Thank you. That brings the time down to 0.5 seconds. I'm glad it was something as simple as that. --PR
I assume GHC was smart enough to do inlining and such in this case, so the difference is that it defaulted to Integer, whose operations are somewhat slower. Isaac

On Sun, 2007-10-28 at 10:23 -0400, Prabhakar Ragde wrote:
Jaak Randmets wrote:
On 10/28/07, Prabhakar Ragde
wrote: For the purposes of learning, I am trying to optimize some variation of the following code for computing all perfect numbers less than 10000.
divisors i = [j | j<-[1..i-1], i `mod` j == 0] main = print [i | i<-[1..10000], i == sum (divisors i)]
I know this is mathematically stupid, but the point is to do a moderate nested-loops computation. On my 2.33GHz dual-core MacBookPro, the obvious C program takes about .3 seconds, and a compiled OCaML program (tail recursion, no lists) about .33 seconds. The above takes about 4 seconds.
You could try giving divisors type signature: divisors :: Int -> [Int]
Thank you. That brings the time down to 0.5 seconds. I'm glad it was something as simple as that. --PR
I'm not sure if Jaak Randmets explained this, but the reason this makes a difference is that Haskell defaults to Integer, an arbitrary precision integer type, that is far more costly than an Int.

Prabhakar Ragde writes:
For the purposes of learning, I am trying to optimize some variation of the following code for computing all perfect numbers less than 10000.
divisors i = [j | j<-[1..i-1], i `mod` j == 0] main = print [i | i<-[1..10000], i == sum (divisors i)]
I know this is mathematically stupid, but the point is to do a moderate nested-loops computation. On my 2.33GHz dual-core MacBookPro, the obvious C program takes about .3 seconds, and a compiled OCaML program (tail recursion, no lists) about .33 seconds. The above takes about 4 seconds.
I've tried using foldl', and doing explicit tail recursion with strict accumulators, but I can't get the running time below 3 seconds. Is it possible to come within striking distance of the other languages? Thanks.
Just a trivial comment... 1. Don't speak about comparing *languages* when you compare *algorithms*, and in particular data structures. 2. Please, DO code the above in C, using linked lists. Compare then. 3. Check the influence of bare printing, separated from the computation. Jerzy Karczmarczuk

Jerzy Karczmarczuk wrote:
Just a trivial comment...
1. Don't speak about comparing *languages* when you compare *algorithms*, and in particular data structures. 2. Please, DO code the above in C, using linked lists. Compare then.
3. Check the influence of bare printing, separated from the computation.
Isn't GHC clever enough to optimize away the entire computation if there is no I/O? I take your points, though I am trying to do something admittedly vague, which is to compare fairly naive yet reasonably idiomatic programs that might be written by a beginning undergraduate. This particular task is already biased towards C, so it is reassuring that code that could be written less than fifty pages into Hutton is competitive with code that could be written -- hmmm, sixty pages into K&R, and a hundred pages into the C text I plan to use. --PR

On Sun, Oct 28, 2007 at 11:26:46AM -0400, Prabhakar Ragde wrote:
Jerzy Karczmarczuk wrote:
Just a trivial comment... 1. Don't speak about comparing *languages* when you compare *algorithms*, and in particular data structures. 2. Please, DO code the above in C, using linked lists. Compare then. 3. Check the influence of bare printing, separated from the computation.
Isn't GHC clever enough to optimize away the entire computation if there is no I/O?
Yes, but GHC is not clever enough to solve the perfect number classification problem. 'length' will suffice, and is prefered for most enumeration benchmarks. Note, however, that there are a grand total of 4 perfect numbers less than 10,000. Not exactly an IO problem. Stefan

Stefan O'Rear adds to the dialogue:
Prabhakar Ragde wrote:
Jerzy Karczmarczuk wrote:
Just a trivial comment... 1. Don't speak about comparing *languages* when you compare *algorithms*, and in particular data structures. 2. Please, DO code the above in C, using linked lists. Compare then. 3. Check the influence of bare printing, separated from the computation.
Isn't GHC clever enough to optimize away the entire computation if there is no I/O?
Yes, but GHC is not clever enough to solve the perfect number classification problem. 'length' will suffice, and is prefered for most enumeratioon benchmarks.
My point didn't concern that point. Haskell compiler cannot change an algorithm using lists into something which deals with indexable arrays, usually faster. Indexing may be faster than the indirection, and the allocation of memory costs. And there is laziness... That's why I proposed to check what happens if one uses linked links in "C". Well, the follow-ups seem to suggest that the main time eater was the overloading. I must say that I am really astonished. It is hard to believe that such a signature can make a factor of 8. Never seen that before. Jerzy Karczmarczuk

jerzy.karczmarczuk:
Stefan O'Rear adds to the dialogue:
Prabhakar Ragde wrote:
Jerzy Karczmarczuk wrote:
Just a trivial comment... 1. Don't speak about comparing *languages* when you compare *algorithms*, and in particular data structures. 2. Please, DO code the above in C, using linked lists. Compare then. 3. Check the influence of bare printing, separated from the computation.
Isn't GHC clever enough to optimize away the entire computation if there is no I/O?
Yes, but GHC is not clever enough to solve the perfect number classification problem. 'length' will suffice, and is prefered for most enumeratioon benchmarks.
My point didn't concern that point. Haskell compiler cannot change an algorithm using lists into something which deals with indexable arrays, usually faster. Indexing may be faster than the indirection, and the allocation of memory costs. And there is laziness... That's why I proposed to check what happens if one uses linked links in "C". Well, the follow-ups seem to suggest that the main time eater was the overloading. I must say that I am really astonished. It is hard to believe that such a signature can make a factor of 8. Never seen that before.
That fits with my experience writing low level numeric code -- Integer can be a killer. -- Don

On Sun, 2007-10-28 at 12:01 -0700, Don Stewart wrote:
jerzy.karczmarczuk:
Stefan O'Rear adds to the dialogue:
Prabhakar Ragde wrote:
Jerzy Karczmarczuk wrote:
Just a trivial comment... 1. Don't speak about comparing *languages* when you compare *algorithms*, and in particular data structures. 2. Please, DO code the above in C, using linked lists. Compare then. 3. Check the influence of bare printing, separated from the computation.
Isn't GHC clever enough to optimize away the entire computation if there is no I/O?
Yes, but GHC is not clever enough to solve the perfect number classification problem. 'length' will suffice, and is prefered for most enumeratioon benchmarks.
My point didn't concern that point. Haskell compiler cannot change an algorithm using lists into something which deals with indexable arrays, usually faster. Indexing may be faster than the indirection, and the allocation of memory costs. And there is laziness... That's why I proposed to check what happens if one uses linked links in "C". Well, the follow-ups seem to suggest that the main time eater was the overloading. I must say that I am really astonished. It is hard to believe that such a signature can make a factor of 8. Never seen that before.
That fits with my experience writing low level numeric code -- Integer can be a killer.
Inline machine operations v. out-of-line calls to an arbitrary precision integer C library: there shouldn't be any surprise here. I could make it even slower by making a Nat type and giving it the type Nat -> [Nat]. Also, this is a well known issue. It is common for people to, for example, write naive fib and then say that Haskell is useless because it's orders of magnitude slower than naive fib in C or whatever. Then you tell them to add an Int -> Int type signature and all of a sudden Haskell is beating C soundly.

Am Sonntag, 28. Oktober 2007 20:09 schrieb Derek Elkins: <snip>
That fits with my experience writing low level numeric code -- Integer can be a killer.
Inline machine operations v. out-of-line calls to an arbitrary precision integer C library: there shouldn't be any surprise here.
Obviously. However, what if 32 bit integers aren't sufficient? What perpetually puzzles me is that in C long long int has very good performance, *much* faster than gmp, in Haskell, on my computer, Int64 is hardly faster than Integer. So I stick to Integer mostly, it may be slow but it's correct. Cheers, Daniel

On Sun, Oct 28, 2007 at 08:40:28PM +0100, Daniel Fischer wrote:
Am Sonntag, 28. Oktober 2007 20:09 schrieb Derek Elkins: <snip>
That fits with my experience writing low level numeric code -- Integer can be a killer.
Inline machine operations v. out-of-line calls to an arbitrary precision integer C library: there shouldn't be any surprise here.
Obviously. However, what if 32 bit integers aren't sufficient? What perpetually puzzles me is that in C long long int has very good performance, *much* faster than gmp, in Haskell, on my computer, Int64 is hardly faster than Integer. So I stick to Integer mostly, it may be slow but it's correct.
Int64 in Glasgow Haskell is not implemented for speed - it uses the FFI to call a number of addition/subtraction/whatever primitives in the runtime. If this is a problem for you, file a feature request on the GHC bugtracker. Stefan

Daniel Fischer wrote:
What perpetually puzzles me is that in C long long int has very good performance, *much* faster than gmp, in Haskell, on my computer, Int64 is hardly faster than Integer.
I tried the example with Int64 and Integer. The integer version was actually quicker ... which is the reason I decided to post the results. C++ version times: 1.125; 1.109; 1.125 Int32 cpu times: 3.203; 3.172; 3.172 Int64 cpu times: 11.734; 11.797; 11.844 Integer cpu times: 9.609; 9.609; 9.500 Interesting that Int64 is *slower* than Integer. On the other side the C version is not that much quicker. I guess the Haskell version is using generic versions of mod and sum (since they are from a library) which would mean indirect calls. The Haskell version probably also creates the list nodes ... even when they get almost immediately garbage collected. Thanks for pointing out Int64 sucks so much :) Peter.

peter:
Daniel Fischer wrote:
What perpetually puzzles me is that in C long long int has very good performance, *much* faster than gmp, in Haskell, on my computer, Int64 is hardly faster than Integer.
I tried the example with Int64 and Integer. The integer version was actually quicker ... which is the reason I decided to post the results.
C++ version times: 1.125; 1.109; 1.125 Int32 cpu times: 3.203; 3.172; 3.172 Int64 cpu times: 11.734; 11.797; 11.844 Integer cpu times: 9.609; 9.609; 9.500
Interesting that Int64 is *slower* than Integer.
On the other side the C version is not that much quicker. I guess the Haskell version is using generic versions of mod and sum (since they are from a library) which would mean indirect calls. The Haskell version probably also creates the list nodes ... even when they get almost immediately garbage collected.
Thanks for pointing out Int64 sucks so much :)
With -O2 ghc will only have a function call to 'sum', with -O2 and the list stream fusion library, there are no indirect calls and the entire program is specialised to a nested loop. Do you have your C++ program handy? I'd expect the fully fused version to be somewhere between 1 and 2x slower. -- Don

Peter Hercek wrote:
C++ version times: 1.125; 1.109; 1.125 Int32 cpu times: 3.203; 3.172; 3.172 Int64 cpu times: 11.734; 11.797; 11.844 Integer cpu times: 9.609; 9.609; 9.500
Ooops, my results ware wrong (nonoptimizing ms cl compiler used and I used -O instead of -O2 in ghc). C++ version times: 1.109; 1.125; 1.125 Int32 cpu times: 1.359; 1.359; 1.375 Int64 cpu times: 11.688; 11.719; 11.766 Integer cpu times: 9.719; 9.703; 9.703 Great result from ghc. Peter.

peter:
Peter Hercek wrote:
C++ version times: 1.125; 1.109; 1.125 Int32 cpu times: 3.203; 3.172; 3.172 Int64 cpu times: 11.734; 11.797; 11.844 Integer cpu times: 9.609; 9.609; 9.500
Ooops, my results ware wrong (nonoptimizing ms cl compiler used and I used -O instead of -O2 in ghc).
C++ version times: 1.109; 1.125; 1.125 Int32 cpu times: 1.359; 1.359; 1.375 Int64 cpu times: 11.688; 11.719; 11.766 Integer cpu times: 9.719; 9.703; 9.703
Great result from ghc.
What Haskell program were you using for this test? The original naive/high level implementation? -- Don

Don Stewart wrote:
C++ version times: 1.109; 1.125; 1.125 Int32 cpu times: 1.359; 1.359; 1.375 Int64 cpu times: 11.688; 11.719; 11.766 Integer cpu times: 9.719; 9.703; 9.703
Great result from ghc.
What Haskell program were you using for this test? The original naive/high level implementation?
-- Don
Here it is (I played only with the types for divisors and perfect; bottom is there from my previous tests, but it should not matter): <---cut---> module Main (bottom, divisors, perfect, main) where import Data.Int bottom = error "_|_" divisors :: Int -> [Int] divisors i = [j | j<-[1..i-1], i `mod` j == 0] perfect :: [Int] perfect = [i | i<-[1..10000], i == sum (divisors i)] main = print perfect <---cut---> and here is the C++ version: <---cut---> #include <iostream> using namespace std; int main() { for (int i = 1; i <= 10000; i++) { int sum = 0; for (int j = 1; j < i; j++) if (i % j == 0) sum += j; if (sum == i) cout << i << " "; } return 0; } <---cut---> OS winxp 64 bit ghc v. 6.6.1 (optins -O2) MS cl.exe version 13.10.3077 (options /G7 /MD) Peter.

Peter Hercek wrote:
MS cl.exe version 13.10.3077 (options /G7 /MD)
And I had cl options wrong too - I need to start to optimize not only to set the optimization target. /G7 /MD -> 1.109; 1.125; 1.125 /Ox /G7 /MD -> 0.922; 0.984; 0.984 Still it does not change the results too much. Peter.

On Sun, 2007-10-28 at 23:34 +0100, Peter Hercek wrote:
Don Stewart wrote:
C++ version times: 1.109; 1.125; 1.125 Int32 cpu times: 1.359; 1.359; 1.375 Int64 cpu times: 11.688; 11.719; 11.766 Integer cpu times: 9.719; 9.703; 9.703
Great result from ghc.
What Haskell program were you using for this test? The original naive/high level implementation?
-- Don
Here it is (I played only with the types for divisors and perfect; bottom is there from my previous tests, but it should not matter): <---cut--->
module Main (bottom, divisors, perfect, main) where import Data.Int
bottom = error "_|_"
divisors :: Int -> [Int] divisors i = [j | j<-[1..i-1], i `mod` j == 0]
perfect :: [Int] perfect = [i | i<-[1..10000], i == sum (divisors i)]
main = print perfect
Try with rem instead of mod. (What the heck is with bottom?)

Derek Elkins wrote:
Try with rem instead of mod.
(What the heck is with bottom?)
The bottom was there by error and I was lazy to redo the tests so I rather posted exactly what I was doing. I do not know the compiler that good to be absolutely sure it cannot have impact on result ... so I rather did not doctor what I did :-) Ok, rem did help quite a bit. Any comments why it is so? Here are summary of results for those interested. I run all the tests once again. Haskell was about 13% slower than C++. MS cl.exe options: /Ox /G7 /MD ghc options: -O2 C++ version: 1.000; 0.984; 0.984 Haskell version specialized to Int: 1.125; 1.125; 1.109 Haskell version specialized to Integer: 8.781; 8.813; 8.813 Haskell version specialized to Int64: 9.781; 9.766; 9.831 The code: % cat b.hs module Main (divisors, perfect, main) where import Data.Int divisors :: Int -> [Int] divisors i = [j | j<-[1..i-1], i `rem` j == 0] perfect :: [Int] perfect = [i | i<-[1..10000], i == sum (divisors i)] main = print perfect % cat b.cpp #include <iostream> using namespace std; int main() { for (int i = 1; i <= 10000; i++) { int sum = 0; for (int j = 1; j < i; j++) if (i % j == 0) sum += j; if (sum == i) cout << i << " "; } return 0; } %

rem is faster because it has slightly different behaviour to mod, and
there happens to be an intel instruction that maps more directly to
rem than to mod, thus making it much faster on intel processors.
Why do you expose perfect and divisors? Maybe if you just expose main,
perfect and divisors will be inlined (although this will only save
10,000 function entries, so will probably have negligible effect).
Rodrigo
On 29/10/2007, Peter Hercek
Derek Elkins wrote:
Try with rem instead of mod.
(What the heck is with bottom?)
The bottom was there by error and I was lazy to redo the tests so I rather posted exactly what I was doing. I do not know the compiler that good to be absolutely sure it cannot have impact on result ... so I rather did not doctor what I did :-)
Ok, rem did help quite a bit. Any comments why it is so?
Here are summary of results for those interested. I run all the tests once again. Haskell was about 13% slower than C++.
MS cl.exe options: /Ox /G7 /MD ghc options: -O2
C++ version: 1.000; 0.984; 0.984 Haskell version specialized to Int: 1.125; 1.125; 1.109 Haskell version specialized to Integer: 8.781; 8.813; 8.813 Haskell version specialized to Int64: 9.781; 9.766; 9.831
The code:
% cat b.hs module Main (divisors, perfect, main) where import Data.Int
divisors :: Int -> [Int] divisors i = [j | j<-[1..i-1], i `rem` j == 0]
perfect :: [Int] perfect = [i | i<-[1..10000], i == sum (divisors i)]
main = print perfect
% cat b.cpp #include <iostream> using namespace std;
int main() { for (int i = 1; i <= 10000; i++) { int sum = 0; for (int j = 1; j < i; j++) if (i % j == 0) sum += j; if (sum == i) cout << i << " "; } return 0; }
%
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Rodrigo Queiro wrote:
Why do you expose perfect and divisors? Maybe if you just expose main, perfect and divisors will be inlined (although this will only save 10,000 function entries, so will probably have negligible effect).
I exposed them so that I can check types in ghci. Hiding them does not seem to have noticeable effect. Thanks, Peter.

Hello all, just to compare the stuff, I get quite other results being on other OS. Thus, the result of C++ compiler may not be that interesting as I do not have the one presented below. My machine: Linux 2.6.23-ARCH #1 SMP PREEMPT Mon Oct 22 12:50:26 CEST 2007 x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux Compilers: g++ --version g++ (GCC) 4.2.2 Copyright (C) 2007 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ghc --version The Glorious Glasgow Haskell Compilation System, version 6.6.1 Measurement: compiled with ghc -O2 time ./mainInteger real 0m4.866s user 0m4.843s sys 0m0.020s compiled with ghc -O2 time ./mainInt64 real 0m2.213s user 0m2.210s sys 0m0.003s compiled with ghc -O2 time ./mainInt real 0m1.149s user 0m1.143s sys 0m0.003s compiled with g++ -O3 time ./mainC real 0m0.271s user 0m0.270s sys 0m0.000s I don't know what is the reason, but the difference between Int, Int64 and Integer is not that dramatic as in example below, nevertheless, the difference between GHC and GNU C++ is very bad.... :-\ Dusan Peter Hercek wrote:
Derek Elkins wrote:
Try with rem instead of mod.
(What the heck is with bottom?)
The bottom was there by error and I was lazy to redo the tests so I rather posted exactly what I was doing. I do not know the compiler that good to be absolutely sure it cannot have impact on result ... so I rather did not doctor what I did :-)
Ok, rem did help quite a bit. Any comments why it is so?
Here are summary of results for those interested. I run all the tests once again. Haskell was about 13% slower than C++.
MS cl.exe options: /Ox /G7 /MD ghc options: -O2
C++ version: 1.000; 0.984; 0.984 Haskell version specialized to Int: 1.125; 1.125; 1.109 Haskell version specialized to Integer: 8.781; 8.813; 8.813 Haskell version specialized to Int64: 9.781; 9.766; 9.831
The code:
% cat b.hs module Main (divisors, perfect, main) where import Data.Int
divisors :: Int -> [Int] divisors i = [j | j<-[1..i-1], i `rem` j == 0]
perfect :: [Int] perfect = [i | i<-[1..10000], i == sum (divisors i)]
main = print perfect
% cat b.cpp #include <iostream> using namespace std;
int main() { for (int i = 1; i <= 10000; i++) { int sum = 0; for (int j = 1; j < i; j++) if (i % j == 0) sum += j; if (sum == i) cout << i << " "; } return 0; }
%
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Dusan Kolar tel: +420 54 114 1238 UIFS FIT VUT Brno fax: +420 54 114 1270 Bozetechova 2 e-mail: kolar@fit.vutbr.cz Brno 612 66 Czech Republic --

Am Montag, 29. Oktober 2007 13:49 schrieb Dusan Kolar:
Hello all,
just to compare the stuff, I get quite other results being on other OS. Thus, the result of C++ compiler may not be that interesting as I do not have the one presented below.
Just to chime in, my results with the code below: dafis@linux:~> uname -a Linux linux 2.4.20-4GB-athlon #1 Mon Mar 17 17:56:47 UTC 2003 i686 unknown unknown GNU/Linux on a 1200 MHz Duron g++ is version 3.3, C++ code compiled with -O3, Haskell with -O2 (GHC 6.6.1) dafis@linux:~> time ./mainC 6 28 496 8128 real 0m1.945s user 0m1.910s sys 0m0.010s dafis@linux:~> time ./mainInt [6,28,496,8128] real 0m2.407s user 0m2.300s sys 0m0.010s dafis@linux:~> time ./mainInt64 [6,28,496,8128] real 0m24.009s user 0m23.900s sys 0m0.050s dafis@linux:~> time ./mainInteger [6,28,496,8128] real 0m21.555s user 0m20.870s sys 0m0.010s So Int is not so much slower than C, Int64 and Integer dramatically slower with Integer beating Int64 here, too. Cheers, Daniel
My machine: Linux 2.6.23-ARCH #1 SMP PREEMPT Mon Oct 22 12:50:26 CEST 2007 x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux
Compilers: g++ --version g++ (GCC) 4.2.2 Copyright (C) 2007 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ghc --version The Glorious Glasgow Haskell Compilation System, version 6.6.1
Measurement: compiled with ghc -O2 time ./mainInteger real 0m4.866s user 0m4.843s sys 0m0.020s
compiled with ghc -O2 time ./mainInt64 real 0m2.213s user 0m2.210s sys 0m0.003s
compiled with ghc -O2 time ./mainInt real 0m1.149s user 0m1.143s sys 0m0.003s
compiled with g++ -O3 time ./mainC real 0m0.271s user 0m0.270s sys 0m0.000s
I don't know what is the reason, but the difference between Int, Int64 and Integer is not that dramatic as in example below, nevertheless, the difference between GHC and GNU C++ is very bad.... :-\
Dusan
The code:
% cat b.hs module Main (divisors, perfect, main) where import Data.Int
divisors :: Int -> [Int] divisors i = [j | j<-[1..i-1], i `rem` j == 0]
perfect :: [Int] perfect = [i | i<-[1..10000], i == sum (divisors i)]
main = print perfect
% cat b.cpp #include <iostream> using namespace std;
int main() { for (int i = 1; i <= 10000; i++) { int sum = 0; for (int j = 1; j < i; j++) if (i % j == 0) sum += j; if (sum == i) cout << i << " "; } return 0; }
%

OK, if somebody wants to speculate (and if it even makes sense for such a microbenchmark) here are some more data. Except different OS and C++ compiler also processor is different. On my side it was AMD Athlon 64 X2 4800+ (2.4GHz, 2x1MiB L2 cache non-shared; C&Q was not switched off during the test). The system has 2GiB RAM. The C++ version had working set about 1.7 MiB, ghc version had it about 2 times bigger. Peter. Dusan Kolar wrote:
Hello all,
just to compare the stuff, I get quite other results being on other OS. Thus, the result of C++ compiler may not be that interesting as I do not have the one presented below.
My machine: Linux 2.6.23-ARCH #1 SMP PREEMPT Mon Oct 22 12:50:26 CEST 2007 x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux
Compilers: g++ --version g++ (GCC) 4.2.2 Copyright (C) 2007 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ghc --version The Glorious Glasgow Haskell Compilation System, version 6.6.1
Measurement: compiled with ghc -O2 time ./mainInteger real 0m4.866s user 0m4.843s sys 0m0.020s
compiled with ghc -O2 time ./mainInt64 real 0m2.213s user 0m2.210s sys 0m0.003s
compiled with ghc -O2 time ./mainInt real 0m1.149s user 0m1.143s sys 0m0.003s
compiled with g++ -O3 time ./mainC real 0m0.271s user 0m0.270s sys 0m0.000s
I don't know what is the reason, but the difference between Int, Int64 and Integer is not that dramatic as in example below, nevertheless, the difference between GHC and GNU C++ is very bad.... :-\
Dusan
Peter Hercek wrote:
Derek Elkins wrote:
Try with rem instead of mod.
(What the heck is with bottom?)
The bottom was there by error and I was lazy to redo the tests so I rather posted exactly what I was doing. I do not know the compiler that good to be absolutely sure it cannot have impact on result ... so I rather did not doctor what I did :-)
Ok, rem did help quite a bit. Any comments why it is so?
Here are summary of results for those interested. I run all the tests once again. Haskell was about 13% slower than C++.
MS cl.exe options: /Ox /G7 /MD ghc options: -O2
C++ version: 1.000; 0.984; 0.984 Haskell version specialized to Int: 1.125; 1.125; 1.109 Haskell version specialized to Integer: 8.781; 8.813; 8.813 Haskell version specialized to Int64: 9.781; 9.766; 9.831
The code:
% cat b.hs module Main (divisors, perfect, main) where import Data.Int
divisors :: Int -> [Int] divisors i = [j | j<-[1..i-1], i `rem` j == 0]
perfect :: [Int] perfect = [i | i<-[1..10000], i == sum (divisors i)]
main = print perfect
% cat b.cpp #include <iostream> using namespace std;
int main() { for (int i = 1; i <= 10000; i++) { int sum = 0; for (int j = 1; j < i; j++) if (i % j == 0) sum += j; if (sum == i) cout << i << " "; } return 0; }
%
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

peter:
Don Stewart wrote:
C++ version times: 1.109; 1.125; 1.125 Int32 cpu times: 1.359; 1.359; 1.375 Int64 cpu times: 11.688; 11.719; 11.766 Integer cpu times: 9.719; 9.703; 9.703
Great result from ghc.
What Haskell program were you using for this test? The original naive/high level implementation?
-- Don
Here it is (I played only with the types for divisors and perfect; bottom is there from my previous tests, but it should not matter): <---cut--->
module Main (bottom, divisors, perfect, main) where import Data.Int
bottom = error "_|_"
divisors :: Int -> [Int] divisors i = [j | j<-[1..i-1], i `mod` j == 0]
perfect :: [Int] perfect = [i | i<-[1..10000], i == sum (divisors i)]
This should be a little faster , as sum will fuse, perfect :: [Int] perfect = [i | i<-[1..10000], i == sum' (divisors i)] where sum' = foldr (+) 0

Don Stewart wrote:
perfect :: [Int] perfect = [i | i<-[1..10000], i == sum (divisors i)]
This should be a little faster , as sum will fuse,
perfect :: [Int] perfect = [i | i<-[1..10000], i == sum' (divisors i)] where sum' = foldr (+) 0
sum' did not help. Times are about the same with Int type. Peter.

Peter Hercek wrote:
Daniel Fischer wrote:
What perpetually puzzles me is that in C long long int has very good performance, *much* faster than gmp, in Haskell, on my computer, Int64 is hardly faster than Integer.
I tried the example with Int64 and Integer. The integer version was actually quicker ... which is the reason I decided to post the results.
C++ version times: 1.125; 1.109; 1.125 Int32 cpu times: 3.203; 3.172; 3.172 Int64 cpu times: 11.734; 11.797; 11.844 Integer cpu times: 9.609; 9.609; 9.500
Interesting that Int64 is *slower* than Integer.
I can believe that. Integer is actually optimised for small values: there's a specialised representation for values that fit in a single word that avoids calling out to the GMP library. As Stefan pointed out, there's a lot of room to improve the performance of Int64, it's just never been a high priority. Cheers, Simon

Hi, Prabhakar Ragde wrote:
divisors i = [j | j<-[1..i-1], i `mod` j == 0] main = print [i | i<-[1..10000], i == sum (divisors i)]
Jerzy Karczmarczuk wrote:
My point didn't concern that point. Haskell compiler cannot change an algorithm using lists into something which deals with indexable arrays, usually faster. Indexing may be faster than the indirection, and the allocation of memory costs. And there is laziness...
This may be true, but it isn't relevant in this case, since the "obvious c program" doesn't need any arrays, only two loops: for (int i = 1; i <= 10000; i++) { int sum = 0; for (int j = 1; j < i; j++) if (i % j == 0) sum += i; if (sum == i) print(i); } Loops can be expressed with lazy lists in Haskell. Therefore, the presented Haskell program is perfectly equivalent to the "obvious c program". Tillmann Rendel

rendel:
Prabhakar Ragde wrote:
divisors i = [j | j<-[1..i-1], i `mod` j == 0] main = print [i | i<-[1..10000], i == sum (divisors i)]
Jerzy Karczmarczuk wrote:
My point didn't concern that point. Haskell compiler cannot change an algorithm using lists into something which deals with indexable arrays, usually faster. Indexing may be faster than the indirection, and the allocation of memory costs. And there is laziness...
This may be true, but it isn't relevant in this case, since the "obvious c program" doesn't need any arrays, only two loops:
for (int i = 1; i <= 10000; i++) { int sum = 0; for (int j = 1; j < i; j++) if (i % j == 0) sum += i; if (sum == i) print(i); }
Loops can be expressed with lazy lists in Haskell. Therefore, the presented Haskell program is perfectly equivalent to the "obvious c program".
So what we would hope is that GHC could transform a set of composed lazy list functions into a doubly nested strict loop in Int#... Let's see if we can get that result from GHC, using a bit of fusion. First, to explain what is happening, let's first replace the `mod` call with `rem`, which is faster, and then desugar the list comprehension and enumerations syntax, to expose the underlying program: default(Int) divisors i = filter (\j -> i `rem`j == 0) (enumFromTo 1 (i-1)) main = print $ filter (\i -> i == sum (divisors i)) (enumFromTo 1 10000) Looking at this we see some good chances for fusion to take place: the enumFromTo should fuse twice with 'filter', using build/foldr fusion. And with stream fusion, the left fold 'sum' should also fuse with pipeline that results from divisors. So my prediction would be that this program would run slightly faster with stream fusion. Let's see... Compiling with -O2 and ghc 6.8 snapshot, with build/foldr fusion, we see two fusion sites, as expected, and a spec-constr of 'sum: $ ghc-6.9.20070916 A.hs -O2 -ddump-simpl-stats RuleFired 1 SPEC Data.List.sum 2 fold/build Good, running this: $ time ./A-stream [6,28,496,8128] ./A-stream 1.29s user 0.02s system 99% cpu 1.319 total Now we can try with stream fusion, using the stream-fusible list library here: http://www.cse.unsw.edu.au/~dons/code/streams/list/ To use these list functions in preference to the default, we have to import: import Prelude hiding (filter,sum,enumFromTo) import Data.List.Stream and since the base library doesn't include stream fusible enumFromTo, we'll have to write our own definition in terms of stream: import Data.Stream (enumFromToInt,unstream) enumFromTo i j = unstream (enumFromToInt i j) Ok, that's easy. Compiling this, we hope to see 3 fusion sites, and all heap-allocated Ints removed: $ ghc-6.9.20070916 A.hs -O2 -ddump-simpl-stats -package list RuleFired 2 filter -> fusible 1 sumInt -> fusible 1 sum spec Int 3 STREAM stream/unstream fusion Terrific! The 'sum' was specialised to Int, then translated to a stream version, the two filters also were translated, then 3 fusion sites were found and fused. Our program should now run faster: $ time ./A-stream [6,28,496,8128] ./A-stream 1.23s user 0.01s system 99% cpu 1.251 total And so it does, with no list allocated for the sum loop. In fact the entire program reduces to a strict unboxed nested loop: unfold = Int# -> [Int] wloop_sum_sV5 :: Int# -> Int# -> Int# So almost identical types to the C program (bar for the return [Int]). Finally, we can manually translate the C code into a confusing set of nested loops with interleaved IO, main = loop 1 where loop !i | i > 10000 = return () | otherwise = if i == go i 0 1 then print i >> loop (i+1) else loop (i+1) go !i !s !j | j <= i-1 = if i `rem` j == 0 then go i (s+j) (j+1) else go i s (j+1) | otherwise = s And we get *no speed benefit* at all! time ./A-loop 1.24s user 0.00s system 98% cpu 1.256 total So the lesson is: write in a high level style, and the compiler can do the work for you. Or, GHC is pretty smart on high level code. -- Don

On Sun, Oct 28, 2007 at 01:25:19PM -0700, Don Stewart wrote:
Finally, we can manually translate the C code into a confusing set of nested loops with interleaved IO,
main = loop 1 where loop !i | i > 10000 = return () | otherwise = if i == go i 0 1 then print i >> loop (i+1) else loop (i+1)
go !i !s !j | j <= i-1 = if i `rem` j == 0 then go i (s+j) (j+1) else go i s (j+1) | otherwise = s
And we get *no speed benefit* at all!
time ./A-loop 1.24s user 0.00s system 98% cpu 1.256 total
So the lesson is: write in a high level style, and the compiler can do the work for you. Or, GHC is pretty smart on high level code.
IO blocks unboxing in GHC. How fast is your mock-C code refactored to do IO outside of the loops only? Stefan

stefanor:
On Sun, Oct 28, 2007 at 01:25:19PM -0700, Don Stewart wrote:
Finally, we can manually translate the C code into a confusing set of nested loops with interleaved IO,
main = loop 1 where loop !i | i > 10000 = return () | otherwise = if i == go i 0 1 then print i >> loop (i+1) else loop (i+1)
go !i !s !j | j <= i-1 = if i `rem` j == 0 then go i (s+j) (j+1) else go i s (j+1) | otherwise = s
And we get *no speed benefit* at all!
time ./A-loop 1.24s user 0.00s system 98% cpu 1.256 total
So the lesson is: write in a high level style, and the compiler can do the work for you. Or, GHC is pretty smart on high level code.
IO blocks unboxing in GHC. How fast is your mock-C code refactored to do IO outside of the loops only?
It doesn't! The above code yields: Main.$wloop :: GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #) $wgo_rMK :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# where $s$wgo_rMI :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#

On Sun, Oct 28, 2007 at 01:43:07PM -0700, Don Stewart wrote:
stefanor:
IO blocks unboxing in GHC. How fast is your mock-C code refactored to do IO outside of the loops only?
It doesn't! The above code yields:
Main.$wloop :: GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
$wgo_rMK :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# where $s$wgo_rMI :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
Ah, interesting. I was reading something too general from http://hackage.haskell.org/trac/ghc/ticket/1592#comment:5. Stefan

On 10/28/07, Stefan O'Rear
On Sun, Oct 28, 2007 at 01:43:07PM -0700, Don Stewart wrote:
stefanor:
IO blocks unboxing in GHC. How fast is your mock-C code refactored to do IO outside of the loops only?
It doesn't! The above code yields:
Main.$wloop :: GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
$wgo_rMK :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# where $s$wgo_rMI :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
Ah, interesting. I was reading something too general from http://hackage.haskell.org/trac/ghc/ticket/1592#comment:5.
Right, unboxing is successful here because the arguments are marked as strict with (!) annotations. The IO hack described in that bug report would only be a problem if the arguments were not marked strict, and were used in code that has to be executed after an IO call. Although I don't think all of the strictness annotations are even necessary here. In any case, loop must evaluate i before the call to print; there would only be an issue if it called an IO function before doing anything that demands i. Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "Accordingly, computer scientists commonly choose models which have bottoms, but prefer them topless." -- Davey & Priestley, _Introduction to Lattices and Order_

dons:
stefanor:
On Sun, Oct 28, 2007 at 01:25:19PM -0700, Don Stewart wrote:
Finally, we can manually translate the C code into a confusing set of nested loops with interleaved IO,
main = loop 1 where loop !i | i > 10000 = return () | otherwise = if i == go i 0 1 then print i >> loop (i+1) else loop (i+1)
go !i !s !j | j <= i-1 = if i `rem` j == 0 then go i (s+j) (j+1) else go i s (j+1) | otherwise = s
And we get *no speed benefit* at all!
time ./A-loop 1.24s user 0.00s system 98% cpu 1.256 total
So the lesson is: write in a high level style, and the compiler can do the work for you. Or, GHC is pretty smart on high level code.
Oh, and we can fuse with the print loop too, yielding an entire program of type Int# -> IO (), and really no intermediate lists (even for the return list). Again, we need to use the fusible version of mapM_: import Prelude hiding (filter,sum,enumFromTo,mapM_,sequence_,map,foldr) import Data.List.Stream import Data.Stream (enumFromToInt,unstream) enumFromTo i j = unstream (enumFromToInt i j) mapM_ f as = sequence_ (map f as) sequence_ ms = foldr (>>) (return ()) ms -- ^ fuse happily default(Int) main = mapM_ print $ filter (\i -> i == sum (divisors i)) (enumFromTo 1 10000) divisors i = filter (\j -> i `rem`j == 0) (enumFromTo 1 (i-1)) And we see the map and foldr fuse in sequence, which in turn fuses with the filter (and rest of the program): 18 RuleFired 5 STREAM stream/unstream fusion 2 filter -> fusible 1 foldr -> fusible 1 map -> fusible 1 sumInt -> fusible The program flattens to a single nested loop, Main.$wloop_foldr :: GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #) which really is equivalent in terms of control flow and intermediate structures to the C program. -- Don

Don Stewart wrote:
default(Int)
divisors i = filter (\j -> i `rem`j == 0) (enumFromTo 1 (i-1)) main = print $ filter (\i -> i == sum (divisors i)) (enumFromTo 1 10000)
...
So almost identical types to the C program (bar for the return [Int]).
Finally, we can manually translate the C code into a confusing set of nested loops with interleaved IO,
how lazy is `print` supposed to be? If it's strict, we need to return / accumulate a list (and in this case, that is not very time-consuming because there are only four numbers in the list) from http://haskell.org/onlinereport/standard-prelude.html print x = putStrLn (show x) show on lists is lazy and doesn't matter that it's not for Ints putStrLn :: String -> IO () putStrLn s = do putStr s putStr "\n" putStr :: String -> IO () putStr s = mapM_ putChar s okay, mapM_ makes it lazy. putChar :: Char -> IO () putChar = primPutChar (It's easy to get confused with OS buffering effects too...) so it should ALL be able to fuse, with no intermediate lists, I think (not sure about [Char] from show... or whether fusion works with IO sequencing...) Isaac

One thing I've noticed is that turning on optimizations significantly
increases the speed of haskell code. Are you comparing code between
languages with -O2 or without opts?
On 10/28/07, Prabhakar Ragde
For the purposes of learning, I am trying to optimize some variation of the following code for computing all perfect numbers less than 10000.
divisors i = [j | j<-[1..i-1], i `mod` j == 0] main = print [i | i<-[1..10000], i == sum (divisors i)]
I know this is mathematically stupid, but the point is to do a moderate nested-loops computation. On my 2.33GHz dual-core MacBookPro, the obvious C program takes about .3 seconds, and a compiled OCaML program (tail recursion, no lists) about .33 seconds. The above takes about 4 seconds.
I've tried using foldl', and doing explicit tail recursion with strict accumulators, but I can't get the running time below 3 seconds. Is it possible to come within striking distance of the other languages? Thanks. --PR _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Ryan Dickie wrote:
One thing I've noticed is that turning on optimizations significantly increases the speed of haskell code. Are you comparing code between languages with -O2 or without opts?
I had done no optimization, but neither -O nor -O2 make a significant difference in either the C or Haskell programs. But using `rem` instead of `mod`, together with the type annotation, makes the two programs take pretty much the same amount of time. --PR
participants (16)
-
ajb@spamcop.net
-
Daniel Fischer
-
Derek Elkins
-
Don Stewart
-
Dusan Kolar
-
Isaac Dupree
-
Jaak Randmets
-
jerzy.karczmarczuk@info.unicaen.fr
-
Peter Hercek
-
Prabhakar Ragde
-
Rodrigo Queiro
-
Ryan Dickie
-
Simon Marlow
-
Stefan O'Rear
-
Tillmann Rendel
-
Tim Chevalier