
Hello haskell-cafe, since there are no objective tests comparing ghc to gcc, i made my own one. these are 3 programs, calculating sum in c++ and haskell: main = print $ sum[1..10^9::Int] main = print $ sum0 (10^9) 0 sum0 :: Int -> Int -> Int sum0 0 !acc = acc sum0 !x !acc = sum0 (x-1) (acc+x) main() { int sum=0; //for(int j=0; j<100;j++) for(int i=0; i<1000*1000*1000;i++) sum += i; return sum; } execution times: sum: ghc 6.6.1 -O2 : 12.433 secs ghc 6.10.1 -O2 : 12.792 secs sum-fast: ghc 6.6.1 -O2 : 1.919 secs ghc 6.10.1 -O2 : 1.856 secs ghc 6.10.1 -O2 -fvia-C : 1.966 secs C++: gcc 3.4.5 -O3 -funroll-loops: 0.062 secs -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Ahem. Seems like you've included time spent on the runtime loading. My results: MigMit:~ MigMit$ gcc -o test -O3 -funroll-loops test.c && time ./test -1243309312 real 0m0.066s user 0m0.063s sys 0m0.002s MigMit:~ MigMit$ rm test; ghc -O2 --make test.hs && time ./test Linking test ... -243309312 real 0m3.201s user 0m3.165s sys 0m0.017s While 3.201 vs. 0.066 seem to be a huge difference, 0.017 vs. 0.002 is not that bad. On 20 Feb 2009, at 16:29, Bulat Ziganshin wrote:
Hello haskell-cafe,
since there are no objective tests comparing ghc to gcc, i made my own one. these are 3 programs, calculating sum in c++ and haskell:
main = print $ sum[1..10^9::Int]
main = print $ sum0 (10^9) 0
sum0 :: Int -> Int -> Int sum0 0 !acc = acc sum0 !x !acc = sum0 (x-1) (acc+x)
main() { int sum=0; //for(int j=0; j<100;j++) for(int i=0; i<1000*1000*1000;i++) sum += i; return sum; }
execution times: sum: ghc 6.6.1 -O2 : 12.433 secs ghc 6.10.1 -O2 : 12.792 secs sum-fast: ghc 6.6.1 -O2 : 1.919 secs ghc 6.10.1 -O2 : 1.856 secs ghc 6.10.1 -O2 -fvia-C : 1.966 secs C++: gcc 3.4.5 -O3 -funroll-loops: 0.062 secs
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Forget it, my bad. On 20 Feb 2009, at 16:48, Miguel Mitrofanov wrote:
Ahem. Seems like you've included time spent on the runtime loading.
My results:
MigMit:~ MigMit$ gcc -o test -O3 -funroll-loops test.c && time ./test -1243309312 real 0m0.066s user 0m0.063s sys 0m0.002s MigMit:~ MigMit$ rm test; ghc -O2 --make test.hs && time ./test Linking test ... -243309312
real 0m3.201s user 0m3.165s sys 0m0.017s
While 3.201 vs. 0.066 seem to be a huge difference, 0.017 vs. 0.002 is not that bad.
On 20 Feb 2009, at 16:29, Bulat Ziganshin wrote:
Hello haskell-cafe,
since there are no objective tests comparing ghc to gcc, i made my own one. these are 3 programs, calculating sum in c++ and haskell:
main = print $ sum[1..10^9::Int]
main = print $ sum0 (10^9) 0
sum0 :: Int -> Int -> Int sum0 0 !acc = acc sum0 !x !acc = sum0 (x-1) (acc+x)
main() { int sum=0; //for(int j=0; j<100;j++) for(int i=0; i<1000*1000*1000;i++) sum += i; return sum; }
execution times: sum: ghc 6.6.1 -O2 : 12.433 secs ghc 6.10.1 -O2 : 12.792 secs sum-fast: ghc 6.6.1 -O2 : 1.919 secs ghc 6.10.1 -O2 : 1.856 secs ghc 6.10.1 -O2 -fvia-C : 1.966 secs C++: gcc 3.4.5 -O3 -funroll-loops: 0.062 secs
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hello Miguel, Friday, February 20, 2009, 4:48:15 PM, you wrote:
Ahem. Seems like you've included time spent on the runtime loading.
for C, i've used additional 100x loop
sys 0m0.002s sys 0m0.017s
While 3.201 vs. 0.066 seem to be a huge difference, 0.017 vs. 0.002 is not that bad.
are you know that "sys" time means? :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

---- Test.hs ---- import Prelude hiding (sum, enumFromTo) import Data.List.Stream (sum, unfoldr) enumFromTo m n = unfoldr f m where f k | k <= n = Just (k,k+1) | otherwise = Nothing main = print . sum $ enumFromTo 1 (10^9 :: Int) ---- snip ---- dolio@zeke % time ./Test 500000000500000000 ./Test 3.12s user 0.03s system 80% cpu 3.922 total dolio@zeke % time ./Test-sum0 500000000500000000 ./Test-sum0 3.47s user 0.02s system 80% cpu 4.348 total dolio@zeke % time ./Test-sum0 500000000500000000 ./Test-sum0 3.60s user 0.02s system 90% cpu 4.009 total dolio@zeke % time ./Test 500000000500000000 ./Test 3.11s user 0.02s system 81% cpu 3.846 total ---- snip ---- "Test-sum0" is with the sum0 function "Test" is the code at the top of this mail. -fvia-c -optc-O3 didn't seem to make a big difference with either Haskell example, so they're both with the default backend. Your C++ code runs slowly on my system (around 1 second), but that's because it uses 32-bit ints, I guess (switching to long int sped it up). -- Dan

Sorry for replying to myself, but I got suspicious about the 6ms runtime of the 64-bit C++ code on my machine. So I looked at the assembly and found this: .LCFI1: movabsq $499999999500000000, %rsi movl $_ZSt4cout, %edi pushq %r12 I'm no assembly guru, but that makes me think that there's no actual computation going on in the runtime for the 64-bit C++ program, whereas the 32-bit one is clearly doing work on my system, since it takes around 1 second. Not that I'd be sad if GHC could reduce that whole constant at compile time, but GCC isn't doing 1 billion adds in 6 (or even 60) milliseconds.

Hello Dan, Friday, February 20, 2009, 5:39:25 PM, you wrote:
Not that I'd be sad if GHC could reduce that whole constant at compile time, but GCC isn't doing 1 billion adds in 6 (or even 60) milliseconds.
yes, that's what was done actually: 22 0020 8D44D01C leal 28(%eax,%edx,8), %eax 23 0024 83C208 addl $8, %edx so, i rechecked with multiplies: mult.hs 12.667 mult-fast.hs 2.512 mult.cpp 0.938 and xors: mult.hs 12.605 mult-fast.hs 1.856 xor.cpp 0.339 -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Fri, Feb 20, 2009 at 6:39 AM, Dan Doel
Sorry for replying to myself, but I got suspicious about the 6ms runtime of the 64-bit C++ code on my machine. So I looked at the assembly and found this:
.LCFI1: movabsq $499999999500000000, %rsi movl $_ZSt4cout, %edi pushq %r12
I'm no assembly guru, but that makes me think that there's no actual computation going on in the runtime for the 64-bit C++ program, whereas the 32-bit one is clearly doing work on my system, since it takes around 1 second.
Not that I'd be sad if GHC could reduce that whole constant at compile time, but GCC isn't doing 1 billion adds in 6 (or even 60) milliseconds.
The GCC optimizer must know that you can't return a value to user space of that large as a return result. In Haskell you're printing it... why not print it in C++?
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Friday 20 February 2009 10:52:03 am David Leimbach wrote:
The GCC optimizer must know that you can't return a value to user space of that large as a return result.
In Haskell you're printing it... why not print it in C++?
I actually changed my local copy to print out the result (since I wanted to make sure it was using 64 bit ints). It didn't make a difference in the timing (of either the 32 or 64 bit version).

Hey guys, what about the LLVM bindings? They seem nice for tight loops this one. -- Felipe.

Bulat Ziganshin
execution times: sum: ghc 6.6.1 -O2 : 12.433 secs ghc 6.10.1 -O2 : 12.792 secs sum-fast: ghc 6.6.1 -O2 : 1.919 secs ghc 6.10.1 -O2 : 1.856 secs ghc 6.10.1 -O2 -fvia-C : 1.966 secs C++: gcc 3.4.5 -O3 -funroll-loops: 0.062 secs
Nice! Now we know that gcc can calculate faster than Haskell can calculate and print. Next time, use exitWith, please. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

When I change the C++ program into:
int n;
scanf("%d", &n);
for(i=0; i
Bulat Ziganshin
wrote: execution times: sum: ghc 6.6.1 -O2 : 12.433 secs ghc 6.10.1 -O2 : 12.792 secs sum-fast: ghc 6.6.1 -O2 : 1.919 secs ghc 6.10.1 -O2 : 1.856 secs ghc 6.10.1 -O2 -fvia-C : 1.966 secs C++: gcc 3.4.5 -O3 -funroll-loops: 0.062 secs
Nice! Now we know that gcc can calculate faster than Haskell can calculate and print. Next time, use exitWith, please.
-- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

"Peter" == Peter Verswyvelen
writes:
Peter> So GHC is about 3 to 4 times slower as Visual C++ / GCC Peter> without loop unrolling, which is not too bad since GHC does Peter> not perform register optimization and loop unrolling yet Peter> no? I would call it rather poor. And I don't accept a since of that form as valid mitigation. -- Colin Adams Preston Lancashire

Well C# does it with a for loop in 2300ms, and when using a IEnumerable
sequence it needs 19936ms. Very much like the Haskell code. But of course
the Haskell code could optimize the sum I guess, I assume it is using the
lazy version of sum by default.
Anyway it was more of a question. Does GHC perform register allocation (e.g.
using graph colouring) and loop unrolling?
On Fri, Feb 20, 2009 at 4:22 PM, Colin Paul Adams
"Peter" == Peter Verswyvelen
writes: Peter> So GHC is about 3 to 4 times slower as Visual C++ / GCC Peter> without loop unrolling, which is not too bad since GHC does Peter> not perform register optimization and loop unrolling yet Peter> no?
I would call it rather poor.
And I don't accept a since of that form as valid mitigation. -- Colin Adams Preston Lancashire

Hello Peter, Friday, February 20, 2009, 6:34:04 PM, you wrote:
Well C# does it with a for loop in 2300ms, and when using a IEnumerable sequence it needs 19936ms. Very much like the Haskell code. But of course the Haskell code could optimize the sum I guess, I assume it is using the lazy version of sum by default.
the question is what is the natural for every language
Anyway it was more of a question. Does GHC perform register allocation (e.g. using graph colouring) and loop unrolling?
afaik, ghc can be compared with 20-years old C compilers. it uses registers for performing tight loops but has very simple register allocation procedure. also it doesn't unroll loops
On Fri, Feb 20, 2009 at 4:22 PM, Colin Paul Adams
wrote: "Peter" == Peter Verswyvelen
writes: Peter> So GHC is about 3 to 4 times slower as Visual C++ / GCC Peter> without loop unrolling, which is not too bad since GHC does Peter> not perform register optimization and loop unrolling yet Peter> no?
I would call it rather poor.
And I don't accept a since of that form as valid mitigation. -- Colin Adams Preston Lancashire
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin
Hello Peter,
Friday, February 20, 2009, 6:34:04 PM, you wrote:
Well C# does it with a for loop in 2300ms, and when using a IEnumerable sequence it needs__19936ms. Very much like the Haskell code. But of course the Haskell code could optimize the sum I guess, I assume it is using the lazy version of sum by default.
the question is what is the natural for every language
Anyway it was more of a question.__Does GHC perform register allocation (e.g. using graph colouring) __and loop unrolling?
afaik, ghc can be compared with 20-years old C compilers. it uses registers for performing tight loops but has very simple register allocation procedure. also it doesn't unroll loops
hmmm... do we have magic-hash vector types and folds and maps on them? I'm only asking because gcc fails to use _anything_ but plain registers. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Concrete examples always help, thanks. Turning this into a ticket with associated test will: - enable others to find and repeat the test when this thread is long gone, to see whether any other ghc changes have helped in any way - enable documentation of what exactly the issue is (why is it slow?) - enable others to vote for having this issue addressed - help to keep the various performance issues separate (I seem to recall that you and others had found some other infelicities in ghc-generated code, and lots of other useful bits a pieces over the years, not all of which have been resolved or made into tickets?) Without ticket, such examples will be snowed under here in no time. With ticket, it will take a little longer!-)
afaik, ghc can be compared with 20-years old C compilers. it uses registers for performing tight loops but has very simple register allocation procedure. also it doesn't unroll loops
I've occasionally run into situations where it would have been nice
to have loop unrolling, or more generally, partial unfolding of recursive
functions. But I've got the feeling that this isn't the whole story here,
or is it?
In simple enough situations, one can roll one's own loop unrolling;),
somewhat like shown below (worker/wrapper split to bring the function
parameter representing the loop body into scope, then template haskell
to unroll its applications syntactically, then optimization by transformation
to get rid of the extra code). It is all rather more complicated than one
would like it to be, what with TH scoping restrictions and all, but perhaps
a library of self-unrolling loop combinators along these lines might help, as
a workaround until ghc does its own unrolling.
Claus
{-# LANGUAGE TemplateHaskell #-}
module Apply where
import Language.Haskell.TH.Syntax
apply i bound | i

Bulat Ziganshin
Hello Claus,
Friday, February 20, 2009, 11:15:59 PM, you wrote:
Turning this into a ticket with associated test will:
but why you think that this is untypical and needs a ticket? ;)
Bulat, you are right in every aspect. You never did anything wrong. Back when you debugged your code all night long, you were only dreaming. It ran perfectly from the very beginning. Your every utterance is Truth as well as every line of your code breathes the Tao. Therefore, we are thankful and never tell you about things that you didn't mess up, overlooked, or ignored and later on plainly forgot, as you just don't do such things. Notice something strange? -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Hello Achim, Friday, February 20, 2009, 11:44:49 PM, you wrote:
Turning this into a ticket with associated test will:
but why you think that this is untypical and needs a ticket? ;)
Bulat, you are right in every aspect. You never did anything wrong.
Achim, this is simplest code one can imagine. so when Simon will go to check ghc optimizations, he will try it without any reports. but Simon, unlike Don, never said that ghc may be compared to gcc. Don, on the other hand, say this everyday. when he is asked for code that shows this, he declined to answer. so - why YOU think that ghc generates fast code and this example is something unusual? can you provide any *technical* arguments or will continue to make personal attacks together with Don? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Achim,
Friday, February 20, 2009, 11:44:49 PM, you wrote:
Turning this into a ticket with associated test will:
but why you think that this is untypical and needs a ticket? ;)
Bulat, you are right in every aspect. You never did anything wrong.
Achim, this is simplest code one can imagine. so when Simon will go to check ghc optimizations, he will try it without any reports. but Simon, unlike Don, never said that ghc may be compared to gcc. Don, on the other hand, say this everyday. when he is asked for code that shows this, he declined to answer. so - why YOU think that ghc generates fast code and this example is something unusual? can you provide any *technical* arguments or will continue to make personal attacks together with Don?
Bulat, you misunderstand, it is not personal! We just want something to work on. Something specific. For example, you've identified loop unrolling as something that could very profitably be improved in GHC, and Claus even wrote a prototype to see what kind of speedups to guess. This is a great contribution! Now we know where to hunt. -- Don

dons:
bulat.ziganshin:
Hello Achim,
Friday, February 20, 2009, 11:44:49 PM, you wrote:
Turning this into a ticket with associated test will:
but why you think that this is untypical and needs a ticket? ;)
Bulat, you are right in every aspect. You never did anything wrong.
Achim, this is simplest code one can imagine. so when Simon will go to check ghc optimizations, he will try it without any reports. but Simon, unlike Don, never said that ghc may be compared to gcc. Don, on the other hand, say this everyday. when he is asked for code that shows this, he declined to answer. so - why YOU think that ghc generates fast code and this example is something unusual? can you provide any *technical* arguments or will continue to make personal attacks together with Don?
Bulat, you misunderstand, it is not personal! We just want something to work on. Something specific.
For example, you've identified loop unrolling as something that could very profitably be improved in GHC, and Claus even wrote a prototype to see what kind of speedups to guess.
This is a great contribution! Now we know where to hunt.
And just to summarise what we have seen: ghc -O2 naive left fold 15.680 gcc -O0 4.500 ghc manual recursion -fasm 1.328 ghc manual recursion 1.035 ghc naive left fold "stream fusion" 0.967 gcc -O1 0.892 ghc "-funroll-loops" -D8 0.623 gcc -O3 -funroll-loops 0.318 ghc "-funroll-loops" -D64 0.088 So what did we learn here? -- Don

Don Stewart ha scritto:
dons:
bulat.ziganshin:
Hello Achim,
Friday, February 20, 2009, 11:44:49 PM, you wrote:
Turning this into a ticket with associated test will: but why you think that this is untypical and needs a ticket? ;)
Bulat, you are right in every aspect. You never did anything wrong. Achim, this is simplest code one can imagine. so when Simon will go to check ghc optimizations, he will try it without any reports. but Simon, unlike Don, never said that ghc may be compared to gcc. Don, on the other hand, say this everyday. when he is asked for code that shows this, he declined to answer. so - why YOU think that ghc generates fast code and this example is something unusual? can you provide any *technical* arguments or will continue to make personal attacks together with Don? Bulat, you misunderstand, it is not personal! We just want something to work on. Something specific.
For example, you've identified loop unrolling as something that could very profitably be improved in GHC, and Claus even wrote a prototype to see what kind of speedups to guess.
This is a great contribution! Now we know where to hunt.
And just to summarise what we have seen:
ghc -O2 naive left fold 15.680 gcc -O0 4.500 ghc manual recursion -fasm 1.328 ghc manual recursion 1.035 ghc naive left fold "stream fusion" 0.967 gcc -O1 0.892 ghc "-funroll-loops" -D8 0.623 gcc -O3 -funroll-loops 0.318 ghc "-funroll-loops" -D64 0.088
As a full comparison I would like to see time for ghc -O0 naive left fold Manlio Perillo

Hello Manlio, Saturday, February 21, 2009, 12:54:00 AM, you wrote:
ghc -O2 naive left fold 15.680
As a full comparison I would like to see time for ghc -O0 naive left fold
he is still waiting :))) but that's really has only theoretical interest, comparing ghc -O2 on low-level haskell code to gcc -O0 has more meaning since ghc has no serious low-level optimizer so these results should be close -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Don, Saturday, February 21, 2009, 12:43:46 AM, you wrote:
gcc -O3 -funroll-loops 0.318 ghc "-funroll-loops" -D64 0.088
So what did we learn here?
nothing new: what you are not interested in real compilers comparison, preferring to demonstrate artificial results -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin
Hello Don,
Saturday, February 21, 2009, 12:43:46 AM, you wrote:
gcc -O3 -funroll-loops 0.318 ghc "-funroll-loops" -D64 0.088
So what did we learn here?
nothing new: what you are not interested in real compilers comparison, preferring to demonstrate artificial results
...that we have a path to get better results than gcc -O3 -funroll-loops, and it's within reach... we even can get there now, albeit not in the most hack-free way imaginable? -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Hello Achim, Saturday, February 21, 2009, 1:17:08 AM, you wrote:
nothing new: what you are not interested in real compilers comparison, preferring to demonstrate artificial results
...that we have a path to get better results than gcc -O3 -funroll-loops, and it's within reach... we even can get there now, albeit not in the most hack-free way imaginable?
well, can this be made for C++? yes. moreover, gcc does this trick *automatically*, while with ghc we need to write 50-line program using Template Haskell and then run it through gcc - and finally get exactly the same optimization we got automatic for C code so, again: this confirms that Don is always build artificial comparisons, optimizing Haskell code by hand and ignoring obvious ways to optimize Haskell code. unfortunately, this doesn't work in real live. and even worse - Don reports this as fair Haskell vs C++ comparison -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 20 Feb 2009, at 23:33, Bulat Ziganshin wrote:
Hello Achim,
Saturday, February 21, 2009, 1:17:08 AM, you wrote:
nothing new: what you are not interested in real compilers comparison, preferring to demonstrate artificial results
...that we have a path to get better results than gcc -O3 -funroll-loops, and it's within reach... we even can get there now, albeit not in the most hack-free way imaginable?
well, can this be made for C++? yes. moreover, gcc does this trick *automatically*, while with ghc we need to write 50-line program using Template Haskell and then run it through gcc - and finally get exactly the same optimization we got automatic for C code
so, again: this confirms that Don is always build artificial comparisons, optimizing Haskell code by hand and ignoring obvious ways to optimize Haskell code. unfortunately, this doesn't work in real live. and even worse - Don reports this as fair Haskell vs C++ comparison
You need look no further than the debian language shootout that things really aren't as bad as you're making out – Haskell comes in in general less than 3x slower than gcc compiled C. Of note, of all the managed languages, this is about the fastest – none of the other languages that offer safety and garbage collection etc get as close as Haskell does. Bob

Hello Thomas, Saturday, February 21, 2009, 1:41:24 AM, you wrote:
You need look no further than the debian language shootout that things really aren't as bad as you're making out √ Haskell comes in in general less than 3x slower than gcc compiled C.
you should look inside these tests, as i done :) most of these tests depends on libraries speed. in one test, PHP is 1st. from 2 or 3 tests that depends on compiler speed, one was fooled by adding special function readInt to ghc libs and the rest are written in low-level haskell code - look the sources
Of note, of all the managed languages, this is about the fastest √ none of the other languages that offer safety and garbage collection etc get as close as Haskell does.
i'm sorry, but this test only shows amount of work spent to optimize these programs. results of some tests was made 10-100 times better, while improving real programs performance needs much more work :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 20 Feb 2009, at 23:52, Bulat Ziganshin wrote:
Hello Thomas,
Saturday, February 21, 2009, 1:41:24 AM, you wrote:
You need look no further than the debian language shootout that things really aren't as bad as you're making out √ Haskell comes in in general less than 3x slower than gcc compiled C.
you should look inside these tests, as i done :)
most of these tests depends on libraries speed. in one test, PHP is 1st. from 2 or 3 tests that depends on compiler speed, one was fooled by adding special function readInt to ghc libs and the rest are written in low-level haskell code - look the sources
Shock news – benchmarks lead to compiler and library optimisations. News at 11!
Of note, of all the managed languages, this is about the fastest √ none of the other languages that offer safety and garbage collection etc get as close as Haskell does.
i'm sorry, but this test only shows amount of work spent to optimize these programs. results of some tests was made 10-100 times better, while improving real programs performance needs much more work :)
I don't get your point at all. Are you implying that none of the code for any of the other languages up there is optimized in any way? Bob

Hello Thomas, Saturday, February 21, 2009, 1:55:33 AM, you wrote:
most of these tests depends on libraries speed. in one test, PHP is 1st. from 2 or 3 tests that depends on compiler speed, one was fooled by adding special function readInt to ghc libs and the rest are written in low-level haskell code - look the sources
Shock news – benchmarks lead to compiler and library optimisations.
read carefully - it was not general-purpose optimization, Don just added function readInt since Haskell read is very slow. if he was optimized general read so many applications will benefit from this - it would be great but if you want to believe that ghc was optimised in the way that avery program becomes 50x faster - i can't stop you :D
i'm sorry, but this test only shows amount of work spent to optimize these programs. results of some tests was made 10-100 times better, while improving real programs performance needs much more work :)
I don't get your point at all. Are you implying that none of the code for any of the other languages up there is optimized in any way?
code for other languages usually written in idiomatic way. it was local Haskell epidemy. you may compare mandelbrot sources in C and Haskell -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

--- On Fri, 2/20/09, Bulat Ziganshin
You need look no further than the debian language shootout that things really aren't as bad as you're making out √ Haskell comes in in general less than 3x slower than gcc compiled C.
you should look inside these tests, as i done :)
When did you look - six months ago? a year ago? 3 years ago?
most of these tests depends on libraries speed
Most? 2 of 12 strongly depend on libraries because PCRE and GMP are explicitly allowed.
in one test, PHP is 1st
I don't believe that has ever been true - which test?
from 2 or 3 tests that depends on compiler speed, one was fooled by adding special function readInt to ghc libs and the rest are written in low-level haskell code - look the sources
Please note that "sum-file" is not included in the current tests. Please note that "sum-file" has not been included since autumn 2008.

Hello Isaac, Saturday, February 21, 2009, 3:28:31 AM, you wrote:
When did you look - six months ago? a year ago? 3 years ago?
ah, again this argument. two weeks ago Don said that ghc changed a lot in 2 years, now when we see that there is no difference, he says that those loop optimizer is somewhere noone know where. now i should look into new set of tests just because everyone else believe that ghc is shiny. please look yourself, i will continue to say about testset i have seen when 6.6 arrived
most of these tests depends on libraries speed Most? 2 of 12 strongly depend on libraries because PCRE and GMP are explicitly allowed.
*were* depending on libs speed. in particular, haskell's triumph - multithreading tests, chameneos or so
in one test, PHP is 1st I don't believe that has ever been true - which test?
large regexps one
Please note that "sum-file" has not been included since autumn 2008.
ok -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

--- On Fri, 2/20/09, Bulat Ziganshin
From: Bulat Ziganshin
Subject: Re[4]: [Haskell-cafe] Re: speed: ghc vs gcc To: "Isaac Gouy" Cc: haskell-cafe@haskell.org Date: Friday, February 20, 2009, 4:43 PM Hello Isaac, Saturday, February 21, 2009, 3:28:31 AM, you wrote:
When did you look - six months ago? a year ago? 3 years ago?
ah, again this argument. two weeks ago Don said that ghc changed a lot in 2 years, now when we see that there is no difference, he says that those loop optimizer is somewhere noone know where. now i should look into new set of tests just because everyone else believe that ghc is shiny. please look yourself, i will continue to say about testset i have seen when 6.6 arrived
If you're going to continue talking about a testset you saw in 2006 then tell people you are talking about 2006.
most of these tests depends on libraries speed Most? 2 of 12 strongly depend on libraries because PCRE and GMP are explicitly allowed.
*were* depending on libs speed. in particular, haskell's triumph - multithreading tests, chameneos or so
Most? If you add those 2, that makes 4 out of 12 (4 out of 17 in the old data).
in one test, PHP is 1st I don't believe that has ever been true - which test?
large regexps one
PHP is not 1st in regex-dna. PHP is not even 1st in regex-dna in the old data. PHP is not even in the first 15 in regex-dna in the old data.

Hello Thomas, Saturday, February 21, 2009, 1:41:24 AM, you wrote:
so, again: this confirms that Don is always build artificial comparisons, optimizing Haskell code by hand and ignoring obvious ways
You need look no further than the debian language shootout that things
and yes - this is the most well-known example of fooled comparison. just ask Don how much time haskell community spent rewriting this code -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Thomas Davie wrote:
You need look no further than the debian language shootout that things really aren't as bad as you're making out – Haskell comes in in general less than 3x slower than gcc compiled C.
Of note, of all the managed languages, this is about the fastest – none of the other languages that offer safety and garbage collection etc get as close as Haskell does.
Bob
OCaml and Clean seems to be pretty fast too.

On 21 Feb 2009, at 00:10, Ahn, Ki Yung wrote:
Thomas Davie wrote:
You need look no further than the debian language shootout that things really aren't as bad as you're making out – Haskell comes in in general less than 3x slower than gcc compiled C. Of note, of all the managed languages, this is about the fastest – none of the other languages that offer safety and garbage collection etc get as close as Haskell does. Bob
OCaml and Clean seems to be pretty fast too.
Very true :). As does C#, but using MS's compiler not mono. I think my conclusion from this thread is "stop arguing, someone being wrong on the internet is not worth it", oh and "cool, new possibly major optimisation for ghc". And finally, something I'd known all along – Haskell is plenty fast enough for writing real world programs, it's not as fast as C, but I don't care – I write so much better code so much faster in it that the tradeoff becomes worth it. Sorry for getting into the slagging match so much, and count me out of this one from now on. Bob

bulat.ziganshin:
Hello Achim,
Saturday, February 21, 2009, 1:17:08 AM, you wrote:
nothing new: what you are not interested in real compilers comparison, preferring to demonstrate artificial results
...that we have a path to get better results than gcc -O3 -funroll-loops, and it's within reach... we even can get there now, albeit not in the most hack-free way imaginable?
well, can this be made for C++? yes. moreover, gcc does this trick *automatically*, while with ghc we need to write 50-line program using Template Haskell and then run it through gcc - and finally get exactly the same optimization we got automatic for C code
so, again: this confirms that Don is always build artificial comparisons, optimizing Haskell code by hand and ignoring obvious ways to optimize Haskell code. unfortunately, this doesn't work in real live. and even worse - Don reports this as fair Haskell vs C++ comparison
This is extremely depressing to read after the good results and lessons of this thread. -- Don

Hello Don, Saturday, February 21, 2009, 1:55:19 AM, you wrote:
This is extremely depressing to read after the good results and lessons of this thread.
you misunderstand, it is not personal! We just want something to sarcasm on. Something specific. -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Fri, Feb 20, 2009 at 10:33 PM, Bulat Ziganshin wrote: Hello Achim, Saturday, February 21, 2009, 1:17:08 AM, you wrote: nothing new: what you are not interested in real compilers comparison,
preferring to demonstrate artificial results ...that we have a path to get better results than gcc -O3
-funroll-loops, and it's within reach... we even can get there now,
albeit not in the most hack-free way imaginable? well, can this be made for C++? yes. moreover, gcc does this trick
*automatically*, while with ghc we need to write 50-line program using
Template Haskell and then run it through gcc - and finally get exactly
the same optimization we got automatic for C code so, again: this confirms that Don is always build artificial
comparisons, optimizing Haskell code by hand and ignoring obvious ways
to optimize Haskell code. unfortunately, this doesn't work in real
live. and even worse - Don reports this as fair Haskell vs C++
comparison Bulat, please, you're missing the point. Nobody is saying that the
template-haskell trick was somehow a viable general strategy right now that
everyone should use by default. It was used as a proof-of-concept that a
simple technique can lead to massive performance improvements - and we get
numbers for how massive it would be (beating gcc for this benchmark).
This isn't about "faking" a benchmark, it's about investigating the reasons
for why the benchmark looks they way it does, doing testing to verify the
assumptions (in this case using TH), and making constructive suggestions
(add loop-unrolling to the compiler). This investigation tells us that in
this case a compiler could beat gcc, if only it were to do loop unrolling in
the way the TH code does. That's a result!
I would ask you to note the simple fact that every single constructive
message in this thread has come from people other than you. I hope this
leads you reconsider your tone and general approach in the future. Haskell
people in general are always pretty good at accepting criticism IME (they
tend to want to fix the problem), don't you think it's odd that it's only
*your* criticism that gets so much flak? Maybe some part of the reason
almost every discussion you're in here usually ends up hostile is *your*
approach?
Regards,
--
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Hello Sebastian, Saturday, February 21, 2009, 2:42:33 AM, you wrote:
Bulat, please, you're missing the point.
actually you are missing the point. i mirror Don's "non-attacking" style of comments on my person. are you mentioned those Don letter? sure - no
Nobody is saying that the template-haskell trick was somehow a viable general strategy right now that everyone should use by default. It was used as a proof-of-concept that a simple technique can lead to massive performance improvements - and we get numbers for how massive it would be (beating gcc for this benchmark).
sorry, but you was fooled too. the situation was the following: i wrote non-optimal code for 64-bit platforms (using 32-bit int) and Don don't corrected it. then he compiled TH-generated code via *gcc* that used "fusion" technique - the same that was used by 32-bit C++ code are you wondered why -D64 version is 8 times faster than -D8 one? it's exactly because *gcc* reduced 64 additions to just one operation. so this "fair" comparison used TH+gcc to generate faster code than gcc with improper data type definitions. if Don will fix C++ program, he will find that it's speed reduced in the same proportion - without TH tricks
This isn't about "faking" a benchmark, it's about investigating the reasons for why the benchmark looks they way it does, doing testing to verify the assumptions (in this case using TH), and making constructive suggestions (add loop-unrolling to the compiler). This investigation tells us that in this case a compiler could beat gcc, if only it were to do loop unrolling in the way the TH code does. That's a result!
yes, in the cases when *gcc* "fuse" loops and you don't allow it do it for C++ code but allows for Haskell - you will win
I would ask you to note the simple fact that every single constructive message in this thread has come from people other than you.
you are ignore, though, the fact that every destructive message in this thread comes against me. it seems that it's a crime here to write about ghc speed anything but praise. in best case people will said that these tests are destructive :lol:
I hope this leads you reconsider your tone and general approach in the future. Haskell people in general are always pretty good at accepting criticism IME (they tend to want to fix the problem),
that criticism??? cows can't fly, and ghc cannot beat gcc in 2 months. that bothers me is people that attack me just for comparing compilers head-to-head -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sat, Feb 21, 2009 at 12:16 AM, Bulat Ziganshin wrote: Hello Sebastian, Saturday, February 21, 2009, 2:42:33 AM, you wrote: Bulat, please, you're missing the point. actually you are missing the point. i mirror Don's
"non-attacking" style of comments on my person. are you mentioned
those Don letter? sure - no Nobody is saying that the
template-haskell trick was somehow a viable general strategy right
now that everyone should use by default. It was used as a
proof-of-concept that a simple technique can lead to massive
performance improvements - and we get numbers for how massive it
would be (beating gcc for this benchmark). sorry, but you was fooled too. the situation was the following: i
wrote non-optimal code for 64-bit platforms (using 32-bit int) and Don
don't corrected it. then he compiled TH-generated code via *gcc* that
used "fusion" technique - the same that was used by 32-bit C++ code are you wondered why -D64 version is 8 times faster than -D8 one? it's
exactly because *gcc* reduced 64 additions to just one operation. so
this "fair" comparison used TH+gcc to generate faster code than gcc
with improper data type definitions. if Don will fix C++ program, he
will find that it's speed reduced in the same proportion - without TH
tricks This isn't about "faking" a benchmark, it's about investigating the
reasons for why the benchmark looks they way it does, doing testing
to verify the assumptions (in this case using TH), and making
constructive suggestions (add loop-unrolling to the compiler). This
investigation tells us that in this case a compiler could beat gcc,
if only it were to do loop unrolling in the way the TH code does. That's a result! yes, in the cases when *gcc* "fuse" loops and you don't allow it do it
for C++ code but allows for Haskell - you will win I would ask you to note the simple fact that every single
constructive message in this thread has come from people other than
you. you are ignore, though, the fact that every destructive message in
this thread comes against me. it seems that it's a crime here to write
about ghc speed anything but praise. in best case people will said
that these tests are destructive :lol: I hope this leads you reconsider your tone and general approach
in the future. Haskell people in general are always pretty good at
accepting criticism IME (they tend to want to fix the problem), that criticism??? cows can't fly, and ghc cannot beat gcc in 2
months. that bothers me is people that attack me just for comparing
compilers head-to-head I'm not going to debate all these points with you because I don't think you
actually responded to mine, but let me just say that MY impression of this
thread is that people attack you not because you compare compilers
head-to-head, but because you do it in an incredibly abrasive and hostile
manner (your messages read much more like "Haha! I told you so, look how
stupid/dishonest you are!", than "Here's a case where GHC produces bad code,
here's some analysis, and here's a ticket/patch for it").
Just because you put a smiley at the end of a thinly veiled ad hominem
doesn't mean you get to pretend that you're just a victim when people get
understandably ticked off at your tone and respond in kind.
Search the archives, performance discussions come up all the time, often
with quite vigorous criticism of GHC's current results, but somehow those
threads manage to stay civil and on point. Please, do a little introspection
and see if you can stick to a more constructive and friendly tone in the
future - I would be willing to bet that if you did, you wouldn't get
"attacked".
--
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

I was intending to send this privately but clicked the wrong button. Apologies for adding even more noise to this discussion. On Sat, Feb 21, 2009 at 12:47 AM, Sebastian Sylvan < sylvan@student.chalmers.se> wrote:
On Sat, Feb 21, 2009 at 12:16 AM, Bulat Ziganshin < bulat.ziganshin@gmail.com> wrote:
Hello Sebastian,
Saturday, February 21, 2009, 2:42:33 AM, you wrote:
Bulat, please, you're missing the point.
actually you are missing the point. i mirror Don's "non-attacking" style of comments on my person. are you mentioned those Don letter? sure - no
Nobody is saying that the template-haskell trick was somehow a viable general strategy right now that everyone should use by default. It was used as a proof-of-concept that a simple technique can lead to massive performance improvements - and we get numbers for how massive it would be (beating gcc for this benchmark).
sorry, but you was fooled too. the situation was the following: i wrote non-optimal code for 64-bit platforms (using 32-bit int) and Don don't corrected it. then he compiled TH-generated code via *gcc* that used "fusion" technique - the same that was used by 32-bit C++ code
are you wondered why -D64 version is 8 times faster than -D8 one? it's exactly because *gcc* reduced 64 additions to just one operation. so this "fair" comparison used TH+gcc to generate faster code than gcc with improper data type definitions. if Don will fix C++ program, he will find that it's speed reduced in the same proportion - without TH tricks
This isn't about "faking" a benchmark, it's about investigating the reasons for why the benchmark looks they way it does, doing testing to verify the assumptions (in this case using TH), and making constructive suggestions (add loop-unrolling to the compiler). This investigation tells us that in this case a compiler could beat gcc, if only it were to do loop unrolling in the way the TH code does. That's
a result!
yes, in the cases when *gcc* "fuse" loops and you don't allow it do it for C++ code but allows for Haskell - you will win
I would ask you to note the simple fact that every single constructive message in this thread has come from people other than you.
you are ignore, though, the fact that every destructive message in this thread comes against me. it seems that it's a crime here to write about ghc speed anything but praise. in best case people will said that these tests are destructive :lol:
I hope this leads you reconsider your tone and general approach in the future. Haskell people in general are always pretty good at accepting criticism IME (they tend to want to fix the problem),
that criticism??? cows can't fly, and ghc cannot beat gcc in 2 months. that bothers me is people that attack me just for comparing compilers head-to-head
I'm not going to debate all these points with you because I don't think you actually responded to mine, but let me just say that MY impression of this thread is that people attack you not because you compare compilers head-to-head, but because you do it in an incredibly abrasive and hostile manner (your messages read much more like "Haha! I told you so, look how stupid/dishonest you are!", than "Here's a case where GHC produces bad code, here's some analysis, and here's a ticket/patch for it"). Just because you put a smiley at the end of a thinly veiled ad hominem doesn't mean you get to pretend that you're just a victim when people get understandably ticked off at your tone and respond in kind.
Search the archives, performance discussions come up all the time, often with quite vigorous criticism of GHC's current results, but somehow those threads manage to stay civil and on point. Please, do a little introspection and see if you can stick to a more constructive and friendly tone in the future - I would be willing to bet that if you did, you wouldn't get "attacked".
-- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862
-- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

On 20 Feb 2009, at 22:57, Bulat Ziganshin wrote:
Hello Don,
Saturday, February 21, 2009, 12:43:46 AM, you wrote:
gcc -O3 -funroll-loops 0.318 ghc "-funroll-loops" -D64 0.088
So what did we learn here?
nothing new: what you are not interested in real compilers comparison, preferring to demonstrate artificial results
I'm not sure what you're getting at Bulat – it's been demonstrated that ghc is slower than gcc for most cases at the moment (many benchmarks will back this up), *however*, it's also easily verified that ghc has had significantly less effort directed at it than gcc and other imperative compilers, thus, there are many places it can improve greatly. In this case, you've pointed out a really great source of heavy optimisation. Thanks a lot :) Now perhaps it might be an idea to be constructive, rather than trying to stand like nelson going "HA HA" at the people with the inferior compiler. ;) Bob

Hello Thomas, Saturday, February 21, 2009, 1:19:47 AM, you wrote:
I'm not sure what you're getting at Bulat √ it's been demonstrated that ghc is slower than gcc for most cases at the moment (many benchmarks will back this up), *however*, it's also easily verified that ghc has had significantly less effort directed at it than gcc and other imperative compilers, thus, there are many places it can improve greatly.
of course. what fool will say that ghc cannot be optimized the same way as gcc? if we spent the same amount of time for improving ghc back-end as was spent for gcc (tens or hundreds man-years?), then *low-level* Haskell code will become as fast as C one, while remaining several times slower to write
In this case, you've pointed out a really great source of heavy optimisation. Thanks a lot :) Now perhaps it might be an idea to be constructive, rather than trying to stand like nelson going "HA HA" at the people with the inferior compiler.
ghc is superior compiler and it's my main instrument. but it can't make coffee and doesn't contain sophisticated code generator. it's why i dissuade from writing video codes in haskell and i don't like situation when someone too lazy to test speed yourself tell us tales and attack me when i say about real situation -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 20 Feb 2009, at 23:41, Bulat Ziganshin wrote:
Hello Thomas,
Saturday, February 21, 2009, 1:19:47 AM, you wrote:
I'm not sure what you're getting at Bulat √ it's been demonstrated that ghc is slower than gcc for most cases at the moment (many benchmarks will back this up), *however*, it's also easily verified that ghc has had significantly less effort directed at it than gcc and other imperative compilers, thus, there are many places it can improve greatly.
of course. what fool will say that ghc cannot be optimized the same way as gcc? if we spent the same amount of time for improving ghc back-end as was spent for gcc (tens or hundreds man-years?), then *low-level* Haskell code will become as fast as C one, while remaining several times slower to write
Considering Haskell compilers have lots more guarenteed conditions to go on (like referential transparency etc), I'd imagine actually that given the same amount of effort, they could compile Haskell code to *much* faster code than C.
In this case, you've pointed out a really great source of heavy optimisation. Thanks a lot :) Now perhaps it might be an idea to be constructive, rather than trying to stand like nelson going "HA HA" at the people with the inferior compiler.
ghc is superior compiler and it's my main instrument. but it can't make coffee and doesn't contain sophisticated code generator. it's why i dissuade from writing video codes in haskell and i don't like situation when someone too lazy to test speed yourself tell us tales and attack me when i say about real situation
I'd hardly say that dons is too lazy – he has after all contributed rather large chunks of code to coming up with good examples, and optimising ghc. Secondly, I don't see him telling tales either – he's being very honest about the performance of Haskell here, and how it might be improved. Finally, I'd hardly call computing a constant in an arbitrarily complex way a "real situation". I think someone needs to get off their high horse and reflect a little. Bob

On Fri, Feb 20, 2009 at 11:51:43PM +0100, Thomas Davie wrote:
of course. what fool will say that ghc cannot be optimized the same way as gcc? if we spent the same amount of time for improving ghc back-end as was spent for gcc (tens or hundreds man-years?), then *low-level* Haskell code will become as fast as C one, while remaining several times slower to write
Considering Haskell compilers have lots more guarenteed conditions to go on (like referential transparency etc), I'd imagine actually that given the same amount of effort, they could compile Haskell code to *much* faster code than C.
Indeed. This is my thesis and jhc is my constructive proof (in progress) of said thesis. :) John -- John Meacham - ⑆repetae.net⑆john⑈

2009/2/20 Bulat Ziganshin
Hello Thomas,
Saturday, February 21, 2009, 1:19:47 AM, you wrote:
I'm not sure what you're getting at Bulat √ it's been demonstrated that ghc is slower than gcc for most cases at the moment (many benchmarks will back this up), *however*, it's also easily verified that ghc has had significantly less effort directed at it than gcc and other imperative compilers, thus, there are many places it can improve greatly.
of course. what fool will say that ghc cannot be optimized the same way as gcc? if we spent the same amount of time for improving ghc back-end as was spent for gcc (tens or hundreds man-years?), then *low-level* Haskell code will become as fast as C one, while remaining several times slower to write
In this case, you've pointed out a really great source of heavy optimisation. Thanks a lot :) Now perhaps it might be an idea to be constructive, rather than trying to stand like nelson going "HA HA" at the people with the inferior compiler.
ghc is superior compiler and it's my main instrument. but it can't make coffee and doesn't contain sophisticated code generator. it's why i dissuade from writing video codes in haskell and i don't like situation when someone too lazy to test speed yourself tell us tales and attack me when i say about real situation
Please people, I found the ghc/ghc+D64/jhc/gcc/ comparison awesome to read; but find quite distracting to have all those 'other' comments. I guess everyone knows quite clearly what others have in mind and where they stand. Maybe it is not necessary to repeat those things every time ? And more specifically, is it necessary to get some inflamatory tone ? Thanks to share knowledge and haskell love ! Thu

nothing should stop you from writing video games in Haskell since the
control logic of many video games is written in e.g. a scripting language
like LUA :-)
sure if you want to write a physics engine in Haskell, that's something
else.
but I've worked with people that wrote physics engines in C/C++, and they
also had to hand optimize specifically for a certain compiler to get things
fast.
so some manual tweaking to Haskell could be acceptable if you want to get
high performance. and I hope that in the future the Haskell compilers will
optimize more and more and more...
2009/2/20 Bulat Ziganshin
Hello Thomas,
Saturday, February 21, 2009, 1:19:47 AM, you wrote:
I'm not sure what you're getting at Bulat √ it's been demonstrated that ghc is slower than gcc for most cases at the moment (many benchmarks will back this up), *however*, it's also easily verified that ghc has had significantly less effort directed at it than gcc and other imperative compilers, thus, there are many places it can improve greatly.
of course. what fool will say that ghc cannot be optimized the same way as gcc? if we spent the same amount of time for improving ghc back-end as was spent for gcc (tens or hundreds man-years?), then *low-level* Haskell code will become as fast as C one, while remaining several times slower to write
In this case, you've pointed out a really great source of heavy optimisation. Thanks a lot :) Now perhaps it might be an idea to be constructive, rather than trying to stand like nelson going "HA HA" at the people with the inferior compiler.
ghc is superior compiler and it's my main instrument. but it can't make coffee and doesn't contain sophisticated code generator. it's why i dissuade from writing video codes in haskell and i don't like situation when someone too lazy to test speed yourself tell us tales and attack me when i say about real situation
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hello Peter, Saturday, February 21, 2009, 2:36:15 AM, you wrote:
nothing should stop you from writing video games in Haskell since
video codec isn't video game :)))
but I've worked with people that wrote physics engines in C/C++, and they also had to hand optimize specifically for a certain compiler to get things fast.
that's important signal. if you need to hand-optimize your code even if you use icl to compile it, using haskell will be like hunting with a hand instead of gun -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Now perhaps I'll be stepping into some lines of fire as it seems like this thread is full of them. If I get in anyone's way please kindly hold your shot ;-) That said, video codecs are the kinds of things that usually benefit greatly from vectorization and parallelization right? These are two areas that have been getting concentration recently. I'm not really familiar with all the codecs involved, but it would probably be a great test case if someone could write a video codec (perhaps not H.264 since I recall someone saying it was ridiculously complicated) in C/C++ and in Haskell using all the DPH/parallelization tricks, as a comparison benchmark to improve the performance of the compiled code coming out of GHC. Having two pieces of code that are decently optimized and should do the same thing seems like it would make finding snags in the GHC performance and fixing them that much easier. Also, hunting with your bare hands rather than with a gun is provably more bad-ass ;-) -Ross On Feb 20, 2009, at 6:52 PM, Bulat Ziganshin wrote:
Hello Peter,
Saturday, February 21, 2009, 2:36:15 AM, you wrote:
nothing should stop you from writing video games in Haskell since
video codec isn't video game :)))
but I've worked with people that wrote physics engines in C/C++, and they also had to hand optimize specifically for a certain compiler to get things fast.
that's important signal. if you need to hand-optimize your code even if you use icl to compile it, using haskell will be like hunting with a hand instead of gun
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

nothing should stop you from writing video games in Haskell since
video codec isn't video game :)))
ouch, mea culpa, I misread your message.
but I've worked with people that wrote physics engines in C/C++, and they also had to hand optimize specifically for a certain compiler to get things fast.
that's important signal. if you need to hand-optimize your code even if you use icl to compile it, using haskell will be like hunting with a hand instead of gun
well in the past I wrote a couple of video codecs (one was a multi core motion JPEG decoder), and I had to write parts of it in assembler to get fast enough frame rates. but that is already more than 10 years ago, today I don't think I could beat a C/C++ compiler like ICL anymore, at least not in a reasonable time frame

Peter Verswyvelen
nothing should stop you from writing video games in Haskell
Show me a studio that uses Haskell and I'd even accept dollars or pounds as payment. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Most people in the games industry that I knew don't even know haskell. they
are trained imperative hackers.
However tim sweeney studies haskell, so it cetainly has influenced at least
one well known game developer.
But I wasn't saying that Haskell *is* used, I said one could use it for
programming the logic of a game.
And I'm not saying a AAA nexgen title :-)
On Sat, Feb 21, 2009 at 1:10 AM, Achim Schneider
Peter Verswyvelen
wrote: nothing should stop you from writing video games in Haskell
Show me a studio that uses Haskell and I'd even accept dollars or pounds as payment.
-- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Bulat Ziganshin
so - why YOU think that ghc generates fast code and this example is something unusual?
I think ghc has decent performance, and that there's room for improvement. I don't care whether you compare it to gcc, malbolge, or hand-written assembly, what matters isn't whether something is unusual, it is that look below vvvvvvvvvvvvvvvvvvvvvvvvvv if there's any way to have faster code than you get with ghc, right now, it's worth a bug report, even if it's going to be tagged as Milestone: ghc 120.10. Non-tracked issues are non-issues. ^^^^^^^^^^^^^^^^^^^^^^^^^^ look above Disk space is cheap, nowadays, and we can't have Simon's backlog run empty. Anything could happen. BTW: If you think I was being sarcastic in my last post, you are mistaken. The message is much simpler. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Hello Achim, Saturday, February 21, 2009, 12:54:33 AM, you wrote:
so - why YOU think that ghc generates fast code and this example is something unusual?
I think ghc has decent performance, and that there's room for improvement. I don't care whether you compare it to gcc,
i'm asking specifically about ghc vs gcc performance. what i'm said is that ghc generates slow code compared to gcc, if you don't agree with me - why you think opposite? are these reasons technical, i.e. some numbers or not?
if there's any way to have faster code than you get with ghc, right now, it's worth a bug report, even if it's going to be tagged as Milestone: ghc 120.10. Non-tracked issues are non-issues.
again, this looks strange. does you mean that every program you analyzed on assembler level has perfect code? or that you just don't need any more speed? i personally checked ghc performance 3 years ago and wrote decent s/w those days. i think that library i wrote was the first with hard low-level optimization and may be becomes an inspiration for ByteString optimizations nevertheless, ghc cannot generate really fast code and when i need speed, i use C++. my archiver is known as world fastest one and it's written in C++ and Haskell combination - C++ where we need speed, Haskell for the rest: i'm sure that it's the best combination until ghc 120.1 next, imagine that you live 20 years ago. you wrote in C and finds that gcc generates slower code than hand-written assembler. you report it - are you think that next day gcc will become as fast as assembler? gcc, like other best C++ compilers, spend many man-years before it got the current speed. ghc back-end developed by just Simon Marlow, and he cannot write gcc-like backend in less than 120 years, with reports or without next, the program i wrote is very primitive. are you really think that Simon can't build a lot of such examples without my help? it's why i consider your idea meaningless and more like an personal attack than real concern about ghc actually, me, like probably you, are not interested so much in better ghc optimizing backend. it's enough fast for my tasks, i use C++ for speed-critical code. i prefer that Simon will improve other sides of backend but problem - not mine, but for haskellers, is that some people said that ghc can generate code that is as fast as gcc one. it will be stupid if someone will start to write say mpeg4 codec and after year of work will find that it need 100 Ghz cpu to work. it's why i made this test, which you are trying to fool now. people that will believe your "arguments" instead of checking numbers may spend a lot of time meaningless. but it's doesn't matter for you, fanatics -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin
but problem - not mine, but for haskellers, is that some people said that ghc can generate code that is as fast as gcc one. it will be stupid if someone will start to write say mpeg4 codec and after year of work will find that it need 100 Ghz cpu to work. it's why i made this test, which you are trying to fool now. people that will believe your "arguments" instead of checking numbers may spend a lot of time meaningless. but it's doesn't matter for you, fanatics
/me takes a breath Well, _if_ I was going to write a mpeg4 codec, I'd be starting of in Haskell, anyway, to get to terms with the algorithm. After being satisfied, in terms of quality and big-O, I'd start investigating how to do it as fast as possible. Most likely, that would include serious amounts of C. In any case, I could point out non-perfect optimisation to the ghc devs. Out of those devs, someone will know why the current optimisations don't work for that case. As soon as the devs know why it doesn't work, chances are that someone else does, or, at least, is curious enough. I certainly don't know jack about the latter topics, I only know that _my_ code doesn't work as it would, were the world perfect. From that, I can infer with certainty that I don't know jack about how long it would take the ghc devs to enable me to replace C with the original Haskell. I only know that, most likely, I couldn't do it faster. Therefore, I just take the chance, open a ticket, and see what happens. After all, if I _don't_ open a ticket, chances that it gets fixed are lower, if not non-existent. Doing all this, I couldn't care less about what you or Don say. One says that you can't have fast Haskell, the other one says that you can, the first one says that yes, in that case, but not in that, the second continues saying well but if you do that, then the first one goes but that doesn't count... I don't care: Both are right, from their own perspectives. If you make predictions on how people are going to interpret something someone says, you're bound to be mistaken. You won't make me believe that ghc can't produce fast code, and Don won't convince me that I will be able to, in every case. What'll make me happy, right now, isn't random test cases, but a test framework that lets us all reproduce each other's experiments in a controlled -- and thus numerically comparable without discussion -- environment. The current state of things leaks information, and isn't able to catch possible regressions, at all. Still, I could not care less about ghc's performance, right now: I'm much more concerned with high-level stuff, right now. What I _do_ know, is that all the results, collected in this thread, won't mean a thing the instant we get another ghc release, and that noone will be able to update the results without wading through flames, again. So, here's my advice: If you care about performance and best-possible communication of its state to the community, set up a page, in the spirit of the language shootout, that compares haskell compilers vs. chosen c and fortran compilers, in a variety of cases. Did I mention that we need automated benchmark comparisons? -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Bulat Ziganshin ha scritto:
[...] but problem - not mine, but for haskellers, is that some people said that ghc can generate code that is as fast as gcc one. it will be stupid if someone will start to write say mpeg4 codec and after year of work will find that it need 100 Ghz cpu to work.
IMHO, a more viable solution is to write the *parser* in GHC, but use the raw code written in C. As an example, I get some problems from mplayer, not caused by the codec, but by the demuxer and stream parsing. Manlio Perillo

On Fri, Feb 20, 2009 at 12:44 PM, Achim Schneider
Bulat, you are right in every aspect. You never did anything wrong. Back when you debugged your code all night long, you were only dreaming.
Achim, this doesn't seem like a constructive way to respond. Bulat's already been taken to task repeatedly for the manner in which he carries on, but being sarcastic in response doesn't help anyone, and serves only to lower the tone of the mailing list. Please keep things polite. Thanks, Bryan.

"Bryan O'Sullivan"
On Fri, Feb 20, 2009 at 12:44 PM, Achim Schneider
wrote: Bulat, you are right in every aspect. You never did anything wrong. Back when you debugged your code all night long, you were only dreaming.
Achim, this doesn't seem like a constructive way to respond. Bulat's already been taken to task repeatedly for the manner in which he carries on, but being sarcastic in response doesn't help anyone, and serves only to lower the tone of the mailing list.
Please keep things polite.
I don't do sarcasm, and I don't attack persons. I was merely pointing out the impressions his bug-reports give to the devs (there isn't anything wrong), and asked him whether there's something strange with that notion. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Turning this into a ticket with associated test will:
but why you think that this is untypical and needs a ticket? ;)
Because generally ghc is doing a good-enough job. And it is doing that because long ago, ghc hq's war cry was "if ghc generates code that is slower than any other haskell implementation, it is a bug, and you should file a ticket". The failure of "avoid success at all cost" has meant shifting priorities, and those priorities tend to imply that alternative implementations can't keep up (not over the full range of ghc's offerings). So there is less "internal" competition, more "other stuff" on todo lists, and good enough just has to be good enough most of the time. But when it isn't, then not just people at ghc hq are still listening. And among all the other good stuff they keep adding, they also keep tweaking ghc's basics, in such a way that what would be hard to do now, may be easier to implement in future ghc versions. And for that, they need input about what they should focus on. And the advertised way of providing that input is the ticket tracker. For tickets, small examples are great - they often highlight aspects that are still present in real code, but are hard to see there (let alone reducing complex code to small test cases that help to fix those cases). In the particular example of unrolling, iirc, the issue was that ghc's internal representation does not easily lend itself to adding some counter that would keep the very modular optimizer from applying recursive unfoldings uselessly. Once that fundamental problem is fixed, I am optimistic that this useful optimization will be looked at again - if there is a ticket to remind everyone. And if -as I suspect- lots of ghc users find themselves doing loop unrolling/partial recursion unfoldings by hand (worker/wrapper for recursive functions is just such an example), they will up the priority on that ticket. You're right that ghc hq can't be everywhere, but they aren't the only ones who are invited to look at those tickets. And everytime someone starts on ghc hacking for some unrelated purpose, they need a small project to start on - well-defined tickets have a chance there. I'm not at all happy with Haskell optimists turning evangelists, but neither is it productive for Haskell pessimists to spread their frustration. There are useful ways for both sides to contribute to the Haskell world, can we focus on those ways? Claus

claus.reinke:
Concrete examples always help, thanks.
In simple enough situations, one can roll one's own loop unrolling;), somewhat like shown below (worker/wrapper split to bring the function parameter representing the loop body into scope, then template haskell to unroll its applications syntactically, then optimization by transformation to get rid of the extra code). It is all rather more complicated than one would like it to be, what with TH scoping restrictions and all, but perhaps a library of self-unrolling loop combinators along these lines might help, as a workaround until ghc does its own unrolling.
Claus
{-# LANGUAGE TemplateHaskell #-} module Apply where import Language.Haskell.TH.Syntax apply i bound | i
$(apply (i+1) bound) f (f i x) |] | otherwise = [| \f x -> x |] {-# LANGUAGE CPP #-} {-# LANGUAGE TemplateHaskell #-} {-# LANGUAGE BangPatterns #-} {-# OPTIONS_GHC -DN=8 -ddump-splices #-} module Main(main) where import Apply main = print $ loopW 1 (10^9) body 0
{-# INLINE loopW #-} loopW :: Int -> Int -> (Int -> Int -> Int) -> Int -> Int loopW i max body acc = loop i acc where loop :: Int -> Int -> Int loop !i !acc | i+N<=max = loop (i+N) ($(apply (0::Int) N) (\j acc->body (i+j) acc) acc) {- loop !i !acc | i+8<=max = loop (i+8) ( body (i+7) $ body (i+6) $ body (i+5) $ body (i+4) $ body (i+3) $ body (i+2) $ body (i+1) $ body i acc) -} loop !i !acc | i<=max = loop (i+1) (body i acc) | otherwise = acc
body :: Int -> Int -> Int body !i !acc = i+acc
Great thinking! This is EXTREMELY COOL! Main.hs:15:42-57: Splicing expression let apply = apply $dOrd = GHC.Base.$f1 $dNum = GHC.Num.$f6 $dLift = Language.Haskell.TH.Syntax.$f18 in apply (0 :: Int) 8 ======> \ f[a1KU] x[a1KV] -> \ f[a1KW] x[a1KX] -> \ f[a1KY] x[a1KZ] -> \ f[a1L0] x[a1L1] -> \ f[a1L2] x[a1L3] -> \ f[a1L4] x[a1L5] -> \ f[a1L6] x[a1L7] -> \ f[a1L8] x[a1L9] -> \ f[a1La] x[a1Lb] -> x[a1Lb] f[a1L8] (f[a1L8] 7 x[a1L9]) f[a1L6] (f[a1L6] 6 x[a1L7]) f[a1L4] (f[a1L4] 5 x[a1L5]) f[a1L2] (f[a1L2] 4 x[a1L3]) f[a1L0] (f[a1L0] 3 x[a1L1]) f[a1KY] (f[a1KY] 2 x[a1KZ]) f[a1KW] (f[a1KW] 1 x[a1KX]) f[a1KU] (f[a1KU] 0 x[a1KV]) In the second argument of `loop', namely `($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc)' In the expression: loop (i + 8) ($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc) In the definition of `loop': loop !i !acc | i + 8 <= max = loop (i + 8) ($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc) So, that's the fastest yet: $ time ./Main 500000000500000000 ./Main 0.61s user 0.00s system 98% cpu 0.623 total And within 2x the best GCC was doing, gcc -O3 -funroll-loops 0.318 If we unroll even further... $ ghc -O2 -fvia-C -optc-O3 -D64 Main.hs $ time ./Main 500000000500000000 ./Main 0.08s user 0.00s system 94% cpu 0.088 total Very very nice, Claus! Now I'm wondering if we can do this via rewrite rules.... -- Don

Hello Peter, Friday, February 20, 2009, 6:18:50 PM, you wrote:
So GHC is about 3 to 4 times slower as Visual C++ / GCC without loop unrolling
why stop on disabling loop unrolling? there are lot of options we can use if we want to make gcc slower :D -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Achim, Friday, February 20, 2009, 5:44:44 PM, you wrote:
Nice! Now we know that gcc can calculate faster than Haskell can calculate and print. Next time, use exitWith, please.
it was done in order to simplify sources. are you really believe that ghc needs more than 1 millisecond to print one number? :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin
Hello Achim,
Friday, February 20, 2009, 5:44:44 PM, you wrote:
Nice! Now we know that gcc can calculate faster than Haskell can calculate and print. Next time, use exitWith, please.
it was done in order to simplify sources. are you really believe that ghc needs more than 1 millisecond to print one number? :)
Well, I know that (Show a) is about as slow as you can get. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Hello Achim, Friday, February 20, 2009, 6:25:31 PM, you wrote:
it was done in order to simplify sources. are you really believe that ghc needs more than 1 millisecond to print one number? :)
Well, I know that (Show a) is about as slow as you can get.
yes, but it's printed only once against 10^9 computations -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On a G4: s.hs (which does not need bang patterns) is:
main = seq (sum0 (10^9) 0) (return ())
sum0 :: Int -> Int -> Int sum0 0 acc = acc sum0 x acc = sum0 (x-1) $! (acc+x)
And s.c is (actually including 10^9, which Bulat's did not):
main() { int sum=0; for(int i=1000*1000*1000; i>0; i--) sum += i; }
I compiled them with ghc --make -O2 s.hs -o shs gcc -o sc -std=c99 -O3 -funroll-loops s.c And timed them: $ time ./shs real 0m3.309s user 0m3.008s sys 0m0.026s $ time ./sc real 0m0.411s user 0m0.316s sys 0m0.006s So C is 9.4 times faster. And via-C did not help: $ ghc -fvia-C -optc "-O3 -funroll-loops" --make -O2 s.hs -o shs-via-C $ time ./shs-via-C real 0m7.051s user 0m3.010s sys 0m0.050s -- Chris

bulat.ziganshin:
Hello haskell-cafe,
since there are no objective tests comparing ghc to gcc, i made my own one. these are 3 programs, calculating sum in c++ and haskell:
Wonderful. Thank you!
main = print $ sum[1..10^9::Int]
This won't be comparable to your loop below, as 'sum' is a left fold (which doesn't fuse under build/foldr). You should use the list implementation from the stream-fusion package (or uvector) if you're expecting it to fuse to the following loop:
main = print $ sum0 (10^9) 0
sum0 :: Int -> Int -> Int sum0 0 !acc = acc sum0 !x !acc = sum0 (x-1) (acc+x)
Note the bang patterns aren't required here. It compiles to the following core: $wsum0 :: Int# -> Int# -> Int# $wsum0 = \ (ww_sON :: Int#) (ww1_sOR :: Int#) -> case ww_sON of ds_XD0 { _ -> $wsum0 (-# ds_XD0 1) (+# ww1_sOR ds_XD0); 0 -> ww1_sOR which is perfect. Main_zdwsum0_info: testq %rsi, %rsi movq %rsi, %rax jne .L2 movq %rdi, %rbx jmp *(%rbp) .L2: leaq -1(%rsi), %rsi addq %rax, %rdi jmp Main_zdwsum0_info Which seems ... OK. $ ghc-core A.hs -fvia-C -optc-O3 $ time ./A 500000000500000000 ./A 1.12s user 0.00s system 99% cpu 1.127 total Works for me. That's on linux x86_64, gcc 4.4 Trying -fasm: Main_zdwsum0_info: .LcQs: movq %rsi,%rax testq %rax,%rax jne .LcQw movq %rdi,%rbx jmp *(%rbp) .LcQw: movq %rdi,%rcx addq %rax,%rcx leaq -1(%rax),%rsi movq %rcx,%rdi jmp Main_zdwsum0_info $ time ./A 500000000500000000 ./A 1.65s user 0.00s system 98% cpu 1.677 total Is a bit slower.
main() { int sum=0; //for(int j=0; j<100;j++) for(int i=0; i<1000*1000*1000;i++) sum += i; return sum; }
Well, that's a bit different. It doesn't print the result, and it returns a different results on 64 bit.... $ gcc -O0 t.c $ time ./a.out -1243309312 ./a.out 3.99s user 0.00s system 88% cpu 4.500 total $ gcc -O1 t.c $ time ./a.out -1243309312 ./a.out 0.88s user 0.00s system 99% cpu 0.892 total $ gcc -O3 -funroll-loops t.c $ time ./a.out -1243309312 ./a.out 0.31s user 0.00s system 97% cpu 0.318 total I don't get anything near the 0.062s which is interesting. The print statement slows things down, I guess... So we have: ghc -fvia-C -O2 1.127 ghc -fasm 1.677 gcc -O0 4.500 gcc -O3 -funroll-loops 0.318 So. some lessons. GHC is around 3-4x slower on this tight loop. (Which isn't as bad as it used to be). That's actually a worse margin than any current shootout program, where we are no worse than 2.9 slower on larger things: http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=ghc&lang2=gcc&box=1
execution times: sum: ghc 6.6.1 -O2 : 12.433 secs ghc 6.10.1 -O2 : 12.792 secs sum-fast: ghc 6.6.1 -O2 : 1.919 secs ghc 6.10.1 -O2 : 1.856 secs ghc 6.10.1 -O2 -fvia-C : 1.966 secs C++: gcc 3.4.5 -O3 -funroll-loops: 0.062 secs
I couldn't reproduce your final number. Now, given GHC gets most of the way there -- I think this might make a good bug report against GHC head, so we can see if the new register allocator helps any. http://hackage.haskell.org/trac/ghc/newticket?type=bug Thanks for the report, Bulat! -- Don

Hello Don, Friday, February 20, 2009, 7:41:33 PM, you wrote:
main = print $ sum[1..10^9::Int]
This won't be comparable to your loop below, as 'sum' is a left fold (which doesn't fuse under build/foldr).
You should use the list implementation from the stream-fusion package (or uvector) if you're expecting it to fuse to the following loop:
it was comparison of native haskell, low-level haskell (which is harder to write than native C) and native C. stream-fusion and any other packages provides libraries for some tasks but they can't make faster maps, for example. so i used plain list
Which seems ... OK.
really? :D
Well, that's a bit different. It doesn't print the result, and it returns a different results on 64 bit....
doesn't matter for testing speed
I don't get anything near the 0.062s which is interesting.
it was beautiful gcc optimization - it added 8 values at once. with xor results are: xor.hs 12.605 xor-fast.hs 1.856 xor.cpp 0.339
The print statement slows things down, I guess...
are you really believe that printing one number needs so much time? :)
So we have:
ghc -fvia-C -O2 1.127 ghc -fasm 1.677 gcc -O0 4.500 gcc -O3 -funroll-loops 0.318
why not compare to ghc -O0? also you can disable loop unrolling in gcc and unroll loops manually in haskell. or you can generate asm code on the fly. there are plenty of tricks to "prove" that gcc generates bad code :D
So. some lessons. GHC is around 3-4x slower on this tight loop. (Which isn't as bad as it used to be).
really? what i see: low-level haskell code is usually 3 times harder to write and 3 times slower than gcc code. native haskell code is tens to thousands times slower than C code (just recall that real programs use type classes and monads in addition to laziness)
That's actually a worse margin than any current shootout program, where we are no worse than 2.9 slower on larger things:
1) most benchmarks there depend on libraries speed. in one test, for example, php is winner 2) for the sum program ghc libs was modified to win in benchmark 3) the remaining 1 or 2 programs that measure speed of ghc-generated code was hardly optimized using low-level code, so they don't have anything common with real haskell code most of us write every day
Now, given GHC gets most of the way there -- I think this might make a good bug report against GHC head, so we can see if the new register allocator helps any.
you mean that 6.11 includes new allocator? in that case you can test it too i believe that ghc developers are able to test sum performance without my bugreports :D -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 20 Feb 2009, at 18:10, Bulat Ziganshin wrote:
Well, that's a bit different. It doesn't print the result, and it returns a different results on 64 bit....
doesn't matter for testing speed
okay then, I wrote a faster Haskell version: main = print "Hello world" Remember – being the same program doesn't matter when all you're testing is speed. Bob

Hello Thomas, Friday, February 20, 2009, 8:22:33 PM, you wrote:
doesn't matter for testing speed
okay then, I wrote a faster Haskell version:
main = print "Hello world"
for you, any language will be fast enough :D -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 20 Feb 2009, at 18:37, Bulat Ziganshin wrote:
Hello Thomas,
Friday, February 20, 2009, 8:22:33 PM, you wrote:
doesn't matter for testing speed
okay then, I wrote a faster Haskell version:
main = print "Hello world"
for you, any language will be fast enough :D
No C was very slow, I've been waiting for this implementation to terminate for over an hour now: int main (int argc, char ** argv) { while (1) { } printf ("1234567890"); }

bulat.ziganshin:
Friday, February 20, 2009, 7:41:33 PM, you wrote:
main = print $ sum[1..10^9::Int]
This won't be comparable to your loop below, as 'sum' is a left fold (which doesn't fuse under build/foldr).
You should use the list implementation from the stream-fusion package (or uvector) if you're expecting it to fuse to the following loop:
it was comparison of native haskell, low-level haskell (which is harder to write than native C) and native C. stream-fusion and any other packages provides libraries for some tasks but they can't make faster maps, for example. so i used plain list
Hmm? Maybe you're not familiar with the state of the art? $ cabal install uvector Write a loop at a high level: import Data.Array.Vector main = print (sumU (enumFromToU 1 (10^9 :: Int))) Compile it: $ ghc-core A.hs -O2 -fvia-C -optc-O3 Yielding: s16h_info: cmpq 6(%rbx), %rdi jg .L2 addq %rdi, %rsi leaq 1(%rdi), %rdi jmp s16h_info Running: $ time ./A 500000000500000000 ./A 0.97s user 0.01s system 99% cpu 0.982 total Now, (trying to avoid the baiting...) this is actually *very* interesting. Why is this faster than the manual recursion we did earlier why do we get better assembly? Again, if you stick to specifics, there's some interesting things we can learn here.
Which seems ... OK.
really? :D
No, see above.
I don't get anything near the 0.062s which is interesting.
it was beautiful gcc optimization - it added 8 values at once. with xor results are:
xor.hs 12.605 xor-fast.hs 1.856 xor.cpp 0.339
GCC is a good loop optimiser. But apparently not my GCC.
So we have:
ghc -fvia-C -O2 1.127 ghc -fasm 1.677 gcc -O0 4.500 gcc -O3 -funroll-loops 0.318
why not compare to ghc -O0? also you can disable loop unrolling in gcc and unroll loops manually in haskell. or you can generate asm code on the fly. there are plenty of tricks to "prove" that gcc generates bad code :D
No, we want to show (I imagine) that GHC is within a factor or two of "C". I usually set my benchmark to beat gcc -O0 fwiw, and then to hope to be within 2x of optimised C. I'm not sure what you're standards are.
So. some lessons. GHC is around 3-4x slower on this tight loop. (Which isn't as bad as it used to be).
really? what i see: low-level haskell code is usually 3 times harder to write and 3 times slower than gcc code. native haskell code is tens to thousands times slower than C code (just recall that real programs use type classes and monads in addition to laziness)
"thousands times", now you're just undermining your own credibility here. Stick to what you can measure. If anything we'd expect GCC's magic loop skillz to be less useful on large code bases.
That's actually a worse margin than any current shootout program, where we are no worse than 2.9 slower on larger things:
1) most benchmarks there depend on libraries speed. in one test, for example, php is winner 2) for the sum program ghc libs was modified to win in benchmark
It is interesting that the < 2.9x slower in the shootout is pretty much what we found in this benchmark too.
3) the remaining 1 or 2 programs that measure speed of ghc-generated code was hardly optimized using low-level code, so they don't have anything common with real haskell code most of us write every day
Depends on where you work.
Now, given GHC gets most of the way there -- I think this might make a good bug report against GHC head, so we can see if the new register allocator helps any.
you mean that 6.11 includes new allocator? in that case you can test it too
Yes. http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/IntegratedCodeG...
i believe that ghc developers are able to test sum performance without my bugreports :D
No! This is not how open source works! You *should submit bug reports* and *analysis*. It is so so much more useful than complaining and throwing stones. -- Don

Don Stewart
No! This is not how open source works! You *should submit bug reports* and *analysis*. It is so so much more useful than complaining and throwing stones.
Exactly. I don't know where, but I read that the vast majorities of Linux bugs are reported, nailed, and then fixed, by at least three different persons: The first reports a misbehaviour, the second manages to find it surfacing in a certain line of code, the third instantly knows how to make it go away. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

barsoap:
Don Stewart
wrote: No! This is not how open source works! You *should submit bug reports* and *analysis*. It is so so much more useful than complaining and throwing stones.
Exactly. I don't know where, but I read that the vast majorities of Linux bugs are reported, nailed, and then fixed, by at least three different persons: The first reports a misbehaviour, the second manages to find it surfacing in a certain line of code, the third instantly knows how to make it go away.
Elaboarting further: Thinking more about Bulat's code gen observations, I think there's something wrong here -- other than that GHC needs the new codegen to do any of the fancier loop optimisations. If we take what I usually see as the best loops GHC can do for this kind of thing: import Data.Array.Vector main = print (sumU (enumFromToU 1 (10^9 :: Int))) And compile it: $ ghc-core A.hs -O2 -fvia-C -optc-O3 We get ideal core, all data structures fused away, and no heap allocation: $wfold_s15t :: Int# -> Int# -> Int# $wfold_s15t = \ (ww1_s150 :: Int#) (ww2_s154 :: Int#) -> case ># ww2_s154 ww_s14U of wild_aWm { False -> $wfold_s15t (+# ww1_s150 ww2_s154) (+# ww2_s154 1); True -> ww1_s150 }; } in case $wfold_s15t 0 1 Which produces nice assembly: s16e_info: cmpq 6(%rbx), %rdi jg .L2 addq %rdi, %rsi leaq 1(%rdi), %rdi jmp s16e_info This is the best GHC will do here, in my experience, and I'm satisifed with it. Short of new backend tweaks, and realising that GHC is not the loop magic compiler GCC is. http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/IntegratedCodeG... We can be happy with this. The compiler is doing exactly what we expect. $ time ./B 500000000500000000 ./B 0.96s user 0.00s system 99% cpu 0.967 total Now, going back to the low level version, Bulat's loop: main() { int sum=0; //for(int j=0; j<100;j++) for(int i=0; i<1000*1000*1000;i++) sum += i; return sum; } What was first confusing for me was that he wrote the loop "backwards" when translating to Haskell, like this: main = print $ sum0 (10^9) 0 sum0 :: Int -> Int -> Int sum0 0 !acc = acc sum0 !x !acc = sum0 (x-1) (acc+x) (The bang patterns aren't needed). Note how he counts backwards from 10^9. Was there a reason for that, Bulat? I wondered if we just got worse code on backwards counting loops. So translating into the "obvious" translation, counting up: main = print (sum0 0 1) sum0 :: Int -> Int -> Int sum0 acc n | n > 10^9 = acc | otherwise = sum0 (acc + n) (n + 1) Which I actually consider to be the same difficulty as writing the C version, fwiw... We start to notice something interesting: $wsum0 :: Int# -> Int# -> Int# $wsum0 = \ (ww_sOH :: Int#) (ww1_sOL :: Int#) -> case lvl2 of wild1_aHn { I# y_aHp -> case ># ww1_sOL y_aHp of wild_B1 { False -> letrec { $wsum01_XPd :: Int# -> Int# -> Int# $wsum01_XPd = \ (ww2_XP4 :: Int#) (ww3_XP9 :: Int#) -> case ># ww3_XP9 y_aHp of wild11_Xs { False -> $wsum01_XPd (+# ww2_XP4 ww3_XP9) (+# ww3_XP9 1); True -> ww2_XP4 }; } in $wsum01_XPd (+# ww_sOH ww1_sOL) (+# ww1_sOL 1); True -> ww_sOH } Why is there an extra test? What is GHC doing? Checking the asm: $ ghc -O2 -fasm sQ3_info: .LcRt: cmpq 8(%rbp),%rsi jg .LcRw leaq 1(%rsi),%rax addq %rsi,%rbx movq %rax,%rsi jmp sQ3_info $ time ./B 500000000500000000 ./B 1.30s user 0.01s system 98% cpu 1.328 total So its a fair bit slower. Now, we should, as a principle, be able to write sum directly as I did , and get the same code from the manual, and automatically , fused version. But we didn't. Checking via C: $ ghc -O2 -optc-O3 -fvia-C Better code, but still a bit slower: sQ3_info: cmpq 8(%rbp), %rsi jg .L8 addq %rsi, %rbx leaq 1(%rsi), %rsi jmp sQ3_info Running: $ time ./B 500000000500000000 ./B 1.01s user 0.01s system 97% cpu 1.035 total So I think we have a bug report! Why did GHC put that extra test in place? Now, none of this addresses (I think) Bulat's point that GCC can unroll loops and do other loop magic. That's handled under a different workflow - the new code generator. I'll create the ticket. -- Don

Don Stewart
(The bang patterns aren't needed). Note how he counts backwards from 10^9. Was there a reason for that, Bulat?
Tests against zero are faster, as you don't need a second operand... by now, some platforms might be smart enough, but down-counting in loops is still in my hard-wired optimisation generator, alongside with xor op, op to set op to zero, shifts instead of divides and similar no-brainers. Testing an Integer against 0 is certainly faster than testing it against 2^128^128. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Don Stewart wrote:
If we take what I usually see as the best loops GHC can do for this kind of thing:
import Data.Array.Vector
main = print (sumU (enumFromToU 1 (10^9 :: Int)))
And compile it:
$ ghc-core A.hs -O2 -fvia-C -optc-O3
We get ideal core, all data structures fused away, and no heap allocation:
$wfold_s15t :: Int# -> Int# -> Int# $wfold_s15t = \ (ww1_s150 :: Int#) (ww2_s154 :: Int#) -> case ># ww2_s154 ww_s14U of wild_aWm { False -> $wfold_s15t (+# ww1_s150 ww2_s154) (+# ww2_s154 1); True -> ww1_s150 }; } in case $wfold_s15t 0 1
Which produces nice assembly:
s16e_info: cmpq 6(%rbx), %rdi jg .L2 addq %rdi, %rsi leaq 1(%rdi), %rdi jmp s16e_info
Note that this does the addition to the accumulator first, and then increments the counter. [snip]
I wondered if we just got worse code on backwards counting loops. So translating into the "obvious" translation, counting up:
main = print (sum0 0 1)
sum0 :: Int -> Int -> Int sum0 acc n | n > 10^9 = acc | otherwise = sum0 (acc + n) (n + 1)
We start to notice something interesting:
$wsum0 :: Int# -> Int# -> Int# $wsum0 = \ (ww_sOH :: Int#) (ww1_sOL :: Int#) -> case lvl2 of wild1_aHn { I# y_aHp -> case ># ww1_sOL y_aHp of wild_B1 { False -> letrec {
$wsum01_XPd :: Int# -> Int# -> Int# $wsum01_XPd = \ (ww2_XP4 :: Int#) (ww3_XP9 :: Int#) -> case ># ww3_XP9 y_aHp of wild11_Xs { False -> $wsum01_XPd (+# ww2_XP4 ww3_XP9) (+# ww3_XP9 1); True -> ww2_XP4 }; } in $wsum01_XPd (+# ww_sOH ww1_sOL) (+# ww1_sOL 1);
True -> ww_sOH }
This is odd, but it doesn't hurt the inner loop, which only involves $wsum01_XPd, and is identical to $wfold_s15t above.
Checking the asm: $ ghc -O2 -fasm
sQ3_info: .LcRt: cmpq 8(%rbp),%rsi jg .LcRw leaq 1(%rsi),%rax addq %rsi,%rbx movq %rax,%rsi jmp sQ3_info
So for some reason ghc ends up doing the (n + 1) addition before the (acc + n) addition in this case - this accounts for the extra instruction, because both n+1 and n need to be kept around for the duration of the addq (which does the acc + n addition).
Checking via C:
$ ghc -O2 -optc-O3 -fvia-C
Better code, but still a bit slower:
sQ3_info: cmpq 8(%rbp), %rsi jg .L8 addq %rsi, %rbx leaq 1(%rsi), %rsi jmp sQ3_info
This code is identical (up to renaming registers and one offset that I can't fully explain, but is probably related to a slight difference in handling pointer tags between the two versions of the code) to the "nice assembly" above.
Running:
$ time ./B 500000000500000000 ./B 1.01s user 0.01s system 97% cpu 1.035 total
Hmm, about 5% slower, are you sure this isn't just noise? If not noise, it may be some alignment effect. Hard to say. Bertram

bertram.felgenhauer:
This is odd, but it doesn't hurt the inner loop, which only involves $wsum01_XPd, and is identical to $wfold_s15t above.
Checking the asm: $ ghc -O2 -fasm
sQ3_info: .LcRt: cmpq 8(%rbp),%rsi jg .LcRw leaq 1(%rsi),%rax addq %rsi,%rbx movq %rax,%rsi jmp sQ3_info
So for some reason ghc ends up doing the (n + 1) addition before the (acc + n) addition in this case - this accounts for the extra instruction, because both n+1 and n need to be kept around for the duration of the addq (which does the acc + n addition).
Yep, well spotted.
Checking via C:
$ ghc -O2 -optc-O3 -fvia-C
Better code, but still a bit slower:
sQ3_info: cmpq 8(%rbp), %rsi jg .L8 addq %rsi, %rbx leaq 1(%rsi), %rsi jmp sQ3_info
This code is identical (up to renaming registers and one offset that I can't fully explain, but is probably related to a slight difference in handling pointer tags between the two versions of the code) to the "nice assembly" above.
Indeed, which is gratifying.
Running:
$ time ./B 500000000500000000 ./B 1.01s user 0.01s system 97% cpu 1.035 total
Hmm, about 5% slower, are you sure this isn't just noise?
If not noise, it may be some alignment effect. Hard to say.
I couldn't get it under 1s from a dozen runs, so assuming some small effect with alignment. Why we get the extra test in the outer loop though, not sure. That's new too I think -- at least I've not seen that pattern before. -- Don

Am Freitag, 20. Februar 2009 18:10 schrieb Bulat Ziganshin:
Hello Don,
Friday, February 20, 2009, 7:41:33 PM, you wrote:
main = print $ sum[1..10^9::Int]
This won't be comparable to your loop below, as 'sum' is a left fold (which doesn't fuse under build/foldr).
You should use the list implementation from the stream-fusion package (or uvector) if you're expecting it to fuse to the following loop:
it was comparison of native haskell, low-level haskell (which is harder to write than native C)
Um, not always, and certainly not in cases like this, at least if you call the simple worker loop "low-level Haskell".
and native C. stream-fusion and any other packages provides libraries for some tasks but they can't make faster maps, for example. so i used plain list
Which is of course comparable with a tight loop in C, isn't it? Really, you hurt your cause by including that. You said you wanted to compare ghc to gcc, then compare what they do to comparable code.
Which seems ... OK.
really? :D
Well, that's a bit different. It doesn't print the result, and it returns a different results on 64 bit....
doesn't matter for testing speed
I don't get anything near the 0.062s which is interesting.
it was beautiful gcc optimization - it added 8 values at once. with xor results are:
xor.hs 12.605 xor-fast.hs 1.856 xor.cpp 0.339
The print statement slows things down, I guess...
are you really believe that printing one number needs so much time? :)
So we have:
ghc -fvia-C -O2 1.127 ghc -fasm 1.677 gcc -O0 4.500 gcc -O3 -funroll-loops 0.318
why not compare to ghc -O0? also you can disable loop unrolling in gcc and unroll loops manually in haskell. or you can generate asm code on the fly. there are plenty of tricks to "prove" that gcc generates bad code :D
That's not what he's doing at all. Sure, he's not comparing Haskell code compiled without optimisations, but he also includes gcc with highest optimisation level. Read the gcc -O0 figure as an indication of what optimisations can achieve.
So. some lessons. GHC is around 3-4x slower on this tight loop. (Which isn't as bad as it used to be).
really? what i see: low-level haskell code is usually 3 times harder to write and 3 times slower than gcc code.
I deny that low-level Haskell code is three times harder to write than ordinary C code, at least if we consider the worker/wrapper idiom low-level Haskell. It is also my experience that gcc usually creates faster executables from good C code than ghc does from corresponding ordinary Haskell code (not using #-magic), but the margin does vary wildly.
native haskell code is tens to thousands times slower than C code (just recall that real programs use type classes and monads in addition to laziness)
Okay, tens is realistic, but thousands? Of course if you compare a tight loop that doesn't allocate to creating thousands of millions of cons-cells... Just because lists are easier to use in Haskell than in any other language I know doesn't mean it's necessary to use lists when writing Haskell if other ways are more fitting for the goal. Just for the record, timings on my machine, gcc-3.3 vs. ghc-6.8.3: $ ./runtests Sums in C, first counting up, then down with -O0 -243309312 real 0m6.751s user 0m6.660s sys 0m0.020s -243309312 real 0m6.318s user 0m6.190s sys 0m0.000s with -O1 -243309312 real 0m2.533s user 0m2.530s sys 0m0.010s -243309312 real 0m1.744s user 0m1.700s sys 0m0.000s with -O2 -243309312 real 0m1.744s user 0m1.710s sys 0m0.000s -243309312 real 0m1.687s user 0m1.680s sys 0m0.000s with -O3 -243309312 real 0m1.753s user 0m1.720s sys 0m0.000s -243309312 real 0m1.701s user 0m1.700s sys 0m0.000s Sums in Haskell First compiled with -O2, then with -O2 -fvia-C -optc-O3 Using uvector -243309312 real 0m7.412s user 0m7.290s sys 0m0.000s -243309312 real 0m5.726s user 0m5.650s sys 0m0.000s Loop down with BangPatterns -243309312 real 0m4.789s user 0m4.750s sys 0m0.010s -243309312 real 0m4.561s user 0m4.470s sys 0m0.000s Loop down without BangPatterns -243309312 real 0m5.092s user 0m4.890s sys 0m0.000s -243309312 real 0m4.747s user 0m4.540s sys 0m0.010s Loop up (with BangPatterns) -243309312 real 0m5.511s user 0m5.320s sys 0m0.000s -243309312 real 0m4.449s user 0m4.410s sys 0m0.000s Using strict left fold -243309312 real 2m45.625s user 2m41.930s sys 0m0.260s -243309312 real 2m43.890s user 2m41.550s sys 0m0.280s Fully naive -243309312 real 2m45.657s user 2m42.980s sys 0m0.250s -243309312 real 2m42.403s user 2m40.160s sys 0m0.370s Done

Don't forget jhc: on my machine (with 'print' equivalent added to C one to be fair, and 10^9 changed to 1000*1000*1000 just like the C one) ghc: (-O2) time ./foo ./foo 2.26s user 0.00s system 99% cpu 2.273 total gcc: time ./a.out ./a.out 0.34s user 0.00s system 99% cpu 0.341 total jhc: time ./hs.out ./hs.out 0.33s user 0.00s system 96% cpu 0.347 total Yay! it is good to see my goal of C-equivalent performance starting to come true :) John -- John Meacham - ⑆repetae.net⑆john⑈

Hello John, Saturday, February 21, 2009, 1:33:12 AM, you wrote:
Don't forget jhc:
i was pretty sure that jhc will be as fast as gcc :) unfortunately, jhc isn't our production compiler -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 20 Feb 2009, at 23:44, Bulat Ziganshin wrote:
Hello John,
Saturday, February 21, 2009, 1:33:12 AM, you wrote:
Don't forget jhc:
i was pretty sure that jhc will be as fast as gcc :) unfortunately, jhc isn't our production compiler
Why not? There's nothing stopping you from choosing any Haskell compiler you like. If jhc gives you the performance you need – use it. Bob

Hello Thomas, Saturday, February 21, 2009, 1:52:27 AM, you wrote:
i was pretty sure that jhc will be as fast as gcc :) unfortunately, jhc isn't our production compiler
Why not? There's nothing stopping you from choosing any Haskell compiler you like. If jhc gives you the performance you need √ use it.
i don't need jhc speed, i just warn people that believes Don tales. to the best of my knowledge, jhc is not very practical at the moment, but it's better to ask John about current state-of-the-art -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 21 Feb 2009, at 00:01, Bulat Ziganshin wrote:
Hello Thomas,
Saturday, February 21, 2009, 1:52:27 AM, you wrote:
i was pretty sure that jhc will be as fast as gcc :) unfortunately, jhc isn't our production compiler
Why not? There's nothing stopping you from choosing any Haskell compiler you like. If jhc gives you the performance you need √ use it.
i don't need jhc speed, i just warn people that believes Don tales.
Oh, okay then... if you don't need the speed, stop complaining. Bye. Bob

On Fri, Feb 20, 2009 at 11:52:27PM +0100, Thomas Davie wrote:
On 20 Feb 2009, at 23:44, Bulat Ziganshin wrote:
Hello John,
Saturday, February 21, 2009, 1:33:12 AM, you wrote:
Don't forget jhc:
i was pretty sure that jhc will be as fast as gcc :) unfortunately, jhc isn't our production compiler
Why not? There's nothing stopping you from choosing any Haskell compiler you like. If jhc gives you the performance you need – use it.
Heh. He probably meant something more like "jhc is not a production compiler" which is true for a lot of projects. For projects of substantial size or that require many extensions, jhc falls somewhat short. It is getting better though. Of course, help is always appreciated. :) John -- John Meacham - ⑆repetae.net⑆john⑈

Hello John, Saturday, February 21, 2009, 2:14:25 AM, you wrote:
Heh. He probably meant something more like "jhc is not a production compiler" which is true for a lot of projects. For projects of substantial size or that require many extensions, jhc falls somewhat short. It is getting better though. Of course, help is always appreciated. :)
what is "substantial size"? can jhc be used for video codec, i.e. probably no extensions - just raw computations, and thousands or tens of thousands LOCs? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sat, Feb 21, 2009 at 02:24:59AM +0300, Bulat Ziganshin wrote:
Hello John,
Saturday, February 21, 2009, 2:14:25 AM, you wrote:
Heh. He probably meant something more like "jhc is not a production compiler" which is true for a lot of projects. For projects of substantial size or that require many extensions, jhc falls somewhat short. It is getting better though. Of course, help is always appreciated. :)
what is "substantial size"? can jhc be used for video codec, i.e. probably no extensions - just raw computations, and thousands or tens of thousands LOCs?
Perhaps. A bigger issue in practice is that the larger a program is, the more likely it is to depend on some library that depends on a ghc extension. However, base is almost 10000 lines and jhc can compile that into a library without too much effort nowadays, so it might scale. If you try and find it fails, then please submit a bug report to jhc@haskell.org. Too many bugs go unreported I find. If the haskell code has an interface that is strict and unboxable (i.e. only unboxable values passed, such as a video codec passing floats might be) then compiling it with jhc and foreign exporting the functions then foreign importing them into ghc for the bulk of the program would probably work. Probably not worth the effort, but could be an interesting experiment. JOhn -- John Meacham - ⑆repetae.net⑆john⑈

John,
please update the section "All is not well in jhc-land" because now
things are better isn´t?
2009/2/21 John Meacham
On Sat, Feb 21, 2009 at 02:24:59AM +0300, Bulat Ziganshin wrote:
Hello John,
Saturday, February 21, 2009, 2:14:25 AM, you wrote:
Heh. He probably meant something more like "jhc is not a production compiler" which is true for a lot of projects. For projects of substantial size or that require many extensions, jhc falls somewhat short. It is getting better though. Of course, help is always appreciated. :)
what is "substantial size"? can jhc be used for video codec, i.e. probably no extensions - just raw computations, and thousands or tens of thousands LOCs?
Perhaps. A bigger issue in practice is that the larger a program is, the more likely it is to depend on some library that depends on a ghc extension. However, base is almost 10000 lines and jhc can compile that into a library without too much effort nowadays, so it might scale. If you try and find it fails, then please submit a bug report to jhc@haskell.org. Too many bugs go unreported I find.
If the haskell code has an interface that is strict and unboxable (i.e. only unboxable values passed, such as a video codec passing floats might be) then compiling it with jhc and foreign exporting the functions then foreign importing them into ghc for the bulk of the program would probably work. Probably not worth the effort, but could be an interesting experiment.
JOhn
-- John Meacham - ⑆repetae.net⑆john⑈ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Sat, Feb 21, 2009 at 01:20:14AM +0100, Alberto G. Corona wrote:
John, please update the section "All is not well in jhc-land" because now things are better isn´t?
Ah, are you refering to this page? http://repetae.net/computer/jhc/jhc.shtml That is just there for historical reasons as my initial announcement. more up to date info is in the manual: http://repetae.net/computer/jhc/manual.html the becoming a developer page: http://repetae.net/computer/jhc/development.shtml and the how do i just try it out page: http://repetae.net/computer/jhc/building.shtml I guess it isn't clear that that original document is no longer up to date. I will put a big warning on it. John -- John Meacham - ⑆repetae.net⑆john⑈

But it is very misleading. It would be nice to have a log or something similar to inform about the current state ://repetae.net/computer/jhc/jhc.shtml
That is just there for historical reasons as my initial announcement.
more up to date info is
in the manual: http://repetae.net/computer/jhc/manual.html the becoming a developer page: http://repetae.net/computer/jhc/development.shtml and the how do i just try it out page: http://repetae.net/computer/jhc/building.shtml
I guess it isn't clear that that original document is no longer up to date. I will put a big warning on it.
John
-- John Meacham - ⑆repetae.net⑆john⑈ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hello John, Saturday, February 21, 2009, 2:49:25 AM, you wrote:
what is "substantial size"? can jhc be used for video codec, i.e. probably no extensions - just raw computations, and thousands or tens of thousands LOCs?
Perhaps. A bigger issue in practice is that the larger a program is, the more likely it is to depend on some library that depends on a ghc extension.
this is true for *application* code, but for codec you may have lots of code that just compute, compute, compute -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sat, Feb 21, 2009 at 03:21:03AM +0300, Bulat Ziganshin wrote:
what is "substantial size"? can jhc be used for video codec, i.e. probably no extensions - just raw computations, and thousands or tens of thousands LOCs?
Perhaps. A bigger issue in practice is that the larger a program is, the more likely it is to depend on some library that depends on a ghc extension.
this is true for *application* code, but for codec you may have lots of code that just compute, compute, compute
Yes indeed. If there is code like this out there for haskell, I would love to add it as a test case for jhc. I don't see a reason it wouldn't compile to be as fast as C, with the caveat that the strictness analyzer needs to be able to find all the unboxables. It sometimes needs some help with well placed 'seq' statements, but that is true of ghc as well. jhc does suffer a lot more than ghc though when it can't make things strict. John -- John Meacham - ⑆repetae.net⑆john⑈

Hello John, Saturday, February 21, 2009, 3:42:24 AM, you wrote:
this is true for *application* code, but for codec you may have lots of code that just compute, compute, compute
Yes indeed. If there is code like this out there for haskell, I would love to add it as a test case for jhc.
Crypto library has a lot of native haskell code computing hashes and encrypting data hopefully people will show other examples btw, Galois Cryptol has haskell backend, are you know? with jhс compilation it can probably generate as fast code as C backend does. it will be very interesting for us and even look as something close to production usage. i have crossposted message to Don
I don't see a reason it wouldn't compile to be as fast as C, with the caveat that the strictness analyzer needs to be able to find all the unboxables.
there is one problem with haskell - it doesn't support variables and complex control structures. this means that sometimes you need to wrote more complex code to handle situation and as a result, it may be slower than native C -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello John,
Saturday, February 21, 2009, 3:42:24 AM, you wrote:
this is true for *application* code, but for codec you may have lots of code that just compute, compute, compute
Yes indeed. If there is code like this out there for haskell, I would love to add it as a test case for jhc.
Crypto library has a lot of native haskell code computing hashes and encrypting data
hopefully people will show other examples
btw, Galois Cryptol has haskell backend, are you know? with jhс compilation it can probably generate as fast code as C backend does. it will be very interesting for us and even look as something close to production usage. i have crossposted message to Don
That's a very interesting idea. The output from Cryptol is self contained enough, and simple, numerical code, that JHC probably could handle it -- it doesn't require extensive libraries or runtime support, for example. This warrents investigation. Thanks for the suggestion! -- Don

Is there any conceivable factoring of GHC that would allow you to sandwich the core of jhc in between the front and back ends of GHC? Or are the architectures so fundamentally incompatible as to make this impossible? Such a factoring would be one way the community could help, and if successful, it would allow you to better focus on the really important stuff. Regards, John A. De Goes N-BRAIN, Inc. The Evolution of Collaboration http://www.n-brain.net | 877-376-2724 x 101 On Feb 20, 2009, at 4:14 PM, John Meacham wrote:
On Fri, Feb 20, 2009 at 11:52:27PM +0100, Thomas Davie wrote:
On 20 Feb 2009, at 23:44, Bulat Ziganshin wrote:
Hello John,
Saturday, February 21, 2009, 1:33:12 AM, you wrote:
Don't forget jhc:
i was pretty sure that jhc will be as fast as gcc :) unfortunately, jhc isn't our production compiler
Why not? There's nothing stopping you from choosing any Haskell compiler you like. If jhc gives you the performance you need – use it.
Heh. He probably meant something more like "jhc is not a production compiler" which is true for a lot of projects. For projects of substantial size or that require many extensions, jhc falls somewhat short. It is getting better though. Of course, help is always appreciated. :)
John
-- John Meacham - ⑆repetae.net⑆john⑈ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Would it be possible to separate the frontend (Haskell to Core) and backend
(Core to machine code) from the Haskell compilers (requiring a standard Core
language?)
I'm not sure how many extensions required a change to the Core language.
Most likely this is nice in theory but hard in practice?
On Sun, Feb 22, 2009 at 3:25 PM, John A. De Goes
Is there any conceivable factoring of GHC that would allow you to sandwich the core of jhc in between the front and back ends of GHC? Or are the architectures so fundamentally incompatible as to make this impossible?
Such a factoring would be one way the community could help, and if successful, it would allow you to better focus on the really important stuff.
Regards,
John A. De Goes N-BRAIN, Inc. The Evolution of Collaboration
http://www.n-brain.net | 877-376-2724 x 101
On Feb 20, 2009, at 4:14 PM, John Meacham wrote:
On Fri, Feb 20, 2009 at 11:52:27PM +0100, Thomas Davie wrote:
On 20 Feb 2009, at 23:44, Bulat Ziganshin wrote:
Hello John,
Saturday, February 21, 2009, 1:33:12 AM, you wrote:
Don't forget jhc:
i was pretty sure that jhc will be as fast as gcc :) unfortunately, jhc isn't our production compiler
Why not? There's nothing stopping you from choosing any Haskell compiler you like. If jhc gives you the performance you need – use it.
Heh. He probably meant something more like "jhc is not a production compiler" which is true for a lot of projects. For projects of substantial size or that require many extensions, jhc falls somewhat short. It is getting better though. Of course, help is always appreciated. :)
John
-- John Meacham - ⑆repetae.net⑆john⑈ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Sun, Feb 22, 2009 at 03:36:34PM +0100, Peter Verswyvelen wrote:
Would it be possible to separate the frontend (Haskell to Core) and backend (Core to machine code) from the Haskell compilers (requiring a standard Core language?) I'm not sure how many extensions required a change to the Core language.
Well, it depends on what you mean by 'core'. If you mean a desugared version of haskell, I think such a front end could be quite useful. in particular, I'd like to see a standalone implementation of template haskell. If you mean something lower level, as in the ghc core intermediate language the compiler uses internally, or jhc's core or grin representation, things get a bit more tricky. Although many core languages are somewhat similar, based on a typed lambda calculus of some sort, the details will differ, and translating between them can be lossy. For instance, looking at jhc core: http://repetae.net/computer/jhc/manual.html#jhc-core-type-system you can see it has a very rich language for dealing with strictness and boxedness. For instances, a boxed value known to be in WHNF actually has a different _type_ than one that is possibly unevaluated. Such distinctions are quite useful for jhc's back end but not so much for ghc's, hence ghc core doesn't make that distinction and any translation between the two would 'lose' that useful information. In other cases things are even worse, for instance ghc has a powerful type equality concept in its core language which jhc has no counterpart for, so that information will be lost on translation. But losing that information will actually cause the core to not type check, since ghc core can type some things jhc core cannot (and vice versa) so coercions or other mechanisms to bypass the type system will have to be introduced. So, it is certainly possible to translate between the two, in fact, I made a jhc core -> ghc core translator, but the code it produced was necessarily riddled with unsafeCoerce#'s for everywhere the type systems didn't quite match up. John -- John Meacham - ⑆repetae.net⑆john⑈

On Sun, Feb 22, 2009 at 8:15 AM, John Meacham
On Sun, Feb 22, 2009 at 03:36:34PM +0100, Peter Verswyvelen wrote:
Would it be possible to separate the frontend (Haskell to Core) and backend (Core to machine code) from the Haskell compilers (requiring a standard Core language?) I'm not sure how many extensions required a change to the Core language.
Well, it depends on what you mean by 'core'. If you mean a desugared version of haskell, I think such a front end could be quite useful.
By the way, coming up pretty soon, I will need a desugared *annotated*Haskell for Dana. If anybody has something like this in the works, I'd love to help with it. If it does not exist by the time I need it, I will make it, so if anyone is interested in working on it with me, let me know :-) Luke
in particular, I'd like to see a standalone implementation of template haskell. If you mean something lower level, as in the ghc core intermediate language the compiler uses internally, or jhc's core or grin representation, things get a bit more tricky.
Although many core languages are somewhat similar, based on a typed lambda calculus of some sort, the details will differ, and translating between them can be lossy.
For instance, looking at jhc core: http://repetae.net/computer/jhc/manual.html#jhc-core-type-system you can see it has a very rich language for dealing with strictness and boxedness. For instances, a boxed value known to be in WHNF actually has a different _type_ than one that is possibly unevaluated. Such distinctions are quite useful for jhc's back end but not so much for ghc's, hence ghc core doesn't make that distinction and any translation between the two would 'lose' that useful information.
In other cases things are even worse, for instance ghc has a powerful type equality concept in its core language which jhc has no counterpart for, so that information will be lost on translation. But losing that information will actually cause the core to not type check, since ghc core can type some things jhc core cannot (and vice versa) so coercions or other mechanisms to bypass the type system will have to be introduced.
So, it is certainly possible to translate between the two, in fact, I made a jhc core -> ghc core translator, but the code it produced was necessarily riddled with unsafeCoerce#'s for everywhere the type systems didn't quite match up.
John
-- John Meacham - ⑆repetae.net⑆john⑈ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

2009/2/22 Luke Palmer
On Sun, Feb 22, 2009 at 8:15 AM, John Meacham
wrote: On Sun, Feb 22, 2009 at 03:36:34PM +0100, Peter Verswyvelen wrote:
Would it be possible to separate the frontend (Haskell to Core) and backend (Core to machine code) from the Haskell compilers (requiring a standard Core language?) I'm not sure how many extensions required a change to the Core language.
Well, it depends on what you mean by 'core'. If you mean a desugared version of haskell, I think such a front end could be quite useful.
By the way, coming up pretty soon, I will need a desugared annotated Haskell for Dana. If anybody has something like this in the works, I'd love to help with it. If it does not exist by the time I need it, I will make it, so if anyone is interested in working on it with me, let me know :-)
The ghc-api exposes a type for annotated source: http://haskell.org/ghc/docs/latest/html/libraries/ghc/GHC.html#t%3ATypecheck... Not that i know how to use it.
Luke
in particular, I'd like to see a standalone implementation of template haskell. If you mean something lower level, as in the ghc core intermediate language the compiler uses internally, or jhc's core or grin representation, things get a bit more tricky.
Although many core languages are somewhat similar, based on a typed lambda calculus of some sort, the details will differ, and translating between them can be lossy.
For instance, looking at jhc core: http://repetae.net/computer/jhc/manual.html#jhc-core-type-system you can see it has a very rich language for dealing with strictness and boxedness. For instances, a boxed value known to be in WHNF actually has a different _type_ than one that is possibly unevaluated. Such distinctions are quite useful for jhc's back end but not so much for ghc's, hence ghc core doesn't make that distinction and any translation between the two would 'lose' that useful information.
In other cases things are even worse, for instance ghc has a powerful type equality concept in its core language which jhc has no counterpart for, so that information will be lost on translation. But losing that information will actually cause the core to not type check, since ghc core can type some things jhc core cannot (and vice versa) so coercions or other mechanisms to bypass the type system will have to be introduced.
So, it is certainly possible to translate between the two, in fact, I made a jhc core -> ghc core translator, but the code it produced was necessarily riddled with unsafeCoerce#'s for everywhere the type systems didn't quite match up.
John
-- John Meacham - ⑆repetae.net⑆john⑈ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 23/02/2009, at 2:22 AM, Luke Palmer wrote:
By the way, coming up pretty soon, I will need a desugared annotated Haskell for Dana. If anybody has something like this in the works, I'd love to help with it. If it does not exist by the time I need it, I will make it, so if anyone is interested in working on it with me, let me know :-)
Luke
Hi Luke, Any progress on that front? How much desugaring do you want? What kind of annotations do you want? Can you get what you need from the GHC API? Cheers, Bernie.

On Wed, Feb 25, 2009 at 7:43 PM, Bernie Pope
On 23/02/2009, at 2:22 AM, Luke Palmer wrote:
By the way, coming up pretty soon, I will need a desugared annotated Haskell for Dana. If anybody has something like this in the works, I'd love to help with it. If it does not exist by the time I need it, I will make it, so if anyone is interested in working on it with me, let me know :-)
Luke
Hi Luke,
Any progress on that front?
Not yet. It's still a few items down in the queue.
How much desugaring do you want? What kind of annotations do you want?
Enough desugaring to make Haskell "simple" (yes, I know that's subjective). Probably somewhere around System-F (Fw perhaps, once I learn what that is). The main things I can think of are getting rid of special notation (do, list comp., where clauses, infix operators) and expanding typeclasses to dictionary passing.
Can you get what you need from the GHC API?
I haven't looked into it, but my guess is not. The main requirement is that it (and its desugared target) needs to be really pure (no IO, FFI, or polymorphic seq), and that it must be able to desugar itself. It might provide a good launching point though. Luke

On Sun, Feb 22, 2009 at 07:25:26AM -0700, John A. De Goes wrote:
Is there any conceivable factoring of GHC that would allow you to sandwich the core of jhc in between the front and back ends of GHC? Or are the architectures so fundamentally incompatible as to make this impossible?
Well, it wouldn't really be useful sandwiched between the front and back end of ghc, the main advantages jhc has over ghc's model are in its lower level optimizations closer to the back end. Feeding ghc core into jhc shouldn't be impossible. Oddly enough, I have written a ghc back end for jhc, mainly for testing purposes rather than a serious back end. The tricky part isn't so much the translation of ghc to jhc core, the type system and representation are quite similar, but the difference in the primitives that the two systems expect to exist. for instance, ghc has high level primitives, such as operations on Integers and primitive types that high signedness. While jhcs primitives are much lower level, it has no special concept of Integer for instance, you implement Integer either in pure haskell code or FFI bindings to GMP or some other way, and jhc's primitive numerical types are simply bit patterns with no interpretation, for instance data Int = I Bits32_ data Word = W Bits32_ and the only thing that makes Int signed and Word not is in the 'Num' instances for those types. That said, i don't see any reason a ghc-core front end for jhc wouldn't be possible, an adapter library could be written to provide ghc style primitives on top of the jhc ones. It would certainly be an interesting project. John -- John Meacham - ⑆repetae.net⑆john⑈

I think doing this work would improve the design of GHC by improving modularity and factoring out generalized abstractions. The richest possible core language makes the most sense for a common core, because what's not needed can always be discarded. From your description, it sounds like such a common core language would be a hybrid of jhc and ghc core, because each contains more information for particular subsets of the language. Layering higher-level primitives on lower-level primitives also makes sense, as it's a very flexible approach. How much support and direction can you provide if we get enough people interested in this? Regards, John A. De Goes N-BRAIN, Inc. The Evolution of Collaboration http://www.n-brain.net | 877-376-2724 x 101 On Feb 22, 2009, at 7:45 AM, John Meacham wrote:
On Sun, Feb 22, 2009 at 07:25:26AM -0700, John A. De Goes wrote:
Is there any conceivable factoring of GHC that would allow you to sandwich the core of jhc in between the front and back ends of GHC? Or are the architectures so fundamentally incompatible as to make this impossible?
Well, it wouldn't really be useful sandwiched between the front and back end of ghc, the main advantages jhc has over ghc's model are in its lower level optimizations closer to the back end. Feeding ghc core into jhc shouldn't be impossible. Oddly enough, I have written a ghc back end for jhc, mainly for testing purposes rather than a serious back end.
The tricky part isn't so much the translation of ghc to jhc core, the type system and representation are quite similar, but the difference in the primitives that the two systems expect to exist. for instance, ghc has high level primitives, such as operations on Integers and primitive types that high signedness. While jhcs primitives are much lower level, it has no special concept of Integer for instance, you implement Integer either in pure haskell code or FFI bindings to GMP or some other way, and jhc's primitive numerical types are simply bit patterns with no interpretation, for instance
data Int = I Bits32_ data Word = W Bits32_
and the only thing that makes Int signed and Word not is in the 'Num' instances for those types.
That said, i don't see any reason a ghc-core front end for jhc wouldn't be possible, an adapter library could be written to provide ghc style primitives on top of the jhc ones. It would certainly be an interesting project.
John
-- John Meacham - ⑆repetae.net⑆john⑈ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Bulat Ziganshin
Don't forget jhc:
i was pretty sure that jhc will be as fast as gcc :) unfortunately, jhc isn't our production compiler
Neither is GCC :-) -k -- If I haven't seen further, it is by standing in the footprints of giants

On Friday 20 February 2009 16:29:29 Bulat Ziganshin wrote:
Hello haskell-cafe,
since there are no objective tests comparing ghc to gcc, i made my own one. these are 3 programs, calculating sum in c++ and haskell:
main = print $ sum[1..10^9::Int]
... skipped ...
The discussion is mostly about low level optimizations such as loop unrolling etc. I have another question. Why shouldn't compiler realize that `sum [1..10^9]' is constant and thus evaluate it at compile time? -- Khudakov Alexey

Hello Khudyakov, Saturday, February 21, 2009, 2:07:39 AM, you wrote:
I have another question. Why shouldn't compiler realize that `sum [1..10^9]' is constant and thus evaluate it at compile time?
since we expect that compilation will be done in reasonable amount of time. you cannot guarantee this for list-involving computation -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sat, Feb 21, 2009 at 12:22 AM, Bulat Ziganshin wrote: Hello Khudyakov, Saturday, February 21, 2009, 2:07:39 AM, you wrote: I have another question. Why shouldn't compiler realize that `sum
[1..10^9]'
is constant and thus evaluate it at compile time? since we expect that compilation will be done in reasonable amount of
time. you cannot guarantee this for list-involving computation it would be nice to have a compiler that can run forever, incrementally
generating faster and faster versions of the same program, until you press a
key or a timeout is reached.
then you just let it run before you get to bed ;-)
you could even pass it in a test data set to which it must be optimized;
after the program is compiled, the compiler runs and profiles it, measures
the results, and does another pass to make it faster.
some C++ compilers can already do this (profile based optimization). --
Best regards,
Bulat mailto:Bulat.Ziganshin@gmail.com _______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Peter Verswyvelen
you could even pass it in a test data set to which it must be optimized; after the program is compiled, the compiler runs and profiles it, measures the results, and does another pass to make it faster.
some C++ compilers can already do this (profile based optimization).
Rumor says firefox needs profile based optimization to run faster. Or it is not a rumor at all. I guess for all those goodness of haskell, a heavy profile based optimization should be done and could probably result in a much faster binary than C++. Best, Xiao-Yong -- c/* __o/* <\ * (__ */\ <

Hello Xiao-Yong, Saturday, February 21, 2009, 3:16:28 AM, you wrote:
some C++ compilers can already do this (profile based optimization).
Rumor says firefox needs profile based optimization to run faster. Or it is not a rumor at all.
why it's rumor? PGO is natural optimization technique, just like loop unrolling or register allocation -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Khudyakov Alexey
On Friday 20 February 2009 16:29:29 Bulat Ziganshin wrote:
Hello haskell-cafe,
since there are no objective tests comparing ghc to gcc, i made my own one. these are 3 programs, calculating sum in c++ and haskell:
main = print $ sum[1..10^9::Int]
... skipped ...
The discussion is mostly about low level optimizations such as loop unrolling etc.
I have another question. Why shouldn't compiler realize that `sum [1..10^9]' is constant and thus evaluate it at compile time?
+1. There's a lot that can be done in this area, even without complete information (though I wouldn't mind Haskell being total and dependently typed...), if you hack around unsafePerformIO. There's a good reason the language shootout passes such things as arguments to the program, but not getting faster there doesn't mean that it doesn't make sense to aggressively compile-time evaluate... or automagically memoise, or do any of those other optimisations gcc can only dream of. On the "why"-question... lacking manpower, and, possibly, line count not having enough weight in the shootout. For now, you can just use TH to force things getting evaluated. (Try to do that in C) -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.
participants (30)
-
Achim Schneider
-
Ahn, Ki Yung
-
Alberto G. Corona
-
Andrea Vezzosi
-
Bernie Pope
-
Bertram Felgenhauer
-
Bryan O'Sullivan
-
Bulat Ziganshin
-
ChrisK
-
Claus Reinke
-
Colin Paul Adams
-
Dan Doel
-
Daniel Fischer
-
David Leimbach
-
Don Stewart
-
Felipe Lessa
-
Isaac Gouy
-
John A. De Goes
-
John Meacham
-
Ketil Malde
-
Khudyakov Alexey
-
Luke Palmer
-
Manlio Perillo
-
Miguel Mitrofanov
-
minh thu
-
Peter Verswyvelen
-
Ross Mellgren
-
Sebastian Sylvan
-
Thomas Davie
-
Xiao-Yong Jin