
Many people have written words to the effect that the #haskell IRC channel is just bursting with helpful Haskellers who are endlessly friendly and seemingly able to help solve any problem. Personally, this has never been my experience. (More like there's 300 people idling and nobody ever actually speaks...) I am pleased to report that last night I was on #haskell and I did in fact get lots of useful help, from a number of people. So first of all, thanks for that guys! It all relates to this simple program: http://hpaste.org/4438 I pointed out that you can write a complex explicit loop in Java, and that you can translate the same thing into Haskell, and it works. But in Haskell, you can also do it by composing a couple of functions, and it's much easier to read. Somebody else countered that while this is all very cute, it can never be efficient. To which I obviously countered that stream fusion *makes* it efficient. The numbers generated are thus: Program with no particular optimisations: 0.35 seconds. Program with stream fusion [and GHC HEAD]: 0.25 seconds. Program with stream fusion and ByteString: 0.05 seconds. Surely you'd have to work pretty hard to get that kind of speed even in C. ;-) ...erm, actually no. Somebody sat down and wrote something in five minutes that takes 0.005 seconds. Oops! So it seems even with ByteStrings and stream fusion, we're still 10x slower than naive C. :-( You can console yourself that maybe 5 milliseconds is too short a time span to reliably measure, and maybe the speed difference is just down to the OS caching the file in RAM or something. Maybe a benchmark of something that takes tens of seconds would be better. All I know is that once again, it seems Haskell isn't as fast as I thought... *sigh* Well anyway, a few years ago we didn't have fusion, and we didn't have ByteString. A few years ago, the program would have taken 0.35 seconds. The end. Today, with a few import statements and compiler switches, the exact same code takes 0.05 seconds. Tomorrow, who knows? Maybe I'm being overly optimistic, but... ;-)

On Dec 14, 2007 12:12 PM, Andrew Coppin
Today, with a few import statements and compiler switches, the exact same code takes 0.05 seconds. Tomorrow, who knows? Maybe I'm being overly optimistic, but... ;-)
There have been some great improvements in array handling recently. I decided to have a look at the assembly language generated by some simple array manipulation code and understand why C is at least twice as fast as ghc 6.8.1. One the one hand it was disappointing to see that the Haskell register allocator seems a bit inept and was loading data into registers that should never have been spilled out of registers in the first place. On the other hand, the code wasn't fundamentally doing anything weird (eg. it wasn't doing graph reductions or anything like that, it looked like the sort of loop you might get from a C compiler). It was a few seconds of fairly mindless work to fix up the assembly language and make it much faster. And if I can do it, it means that the next generation of Haskell compiler should be able to do it too, after all, good freely available methods to allocate registers do exist. So I'm looking forward to the next version of GHC matching C's performance for inner loops of array manipulation code :-) -- Dan

Hello Dan, Friday, December 14, 2007, 11:57:38 PM, you wrote:
to allocate registers do exist. So I'm looking forward to the next version of GHC matching C's performance for inner loops of array manipulation code :-)
with support of loop unrolling, smart register allocation, strength reducing and so on? :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 12/14/07, Bulat Ziganshin
Hello Dan,
Friday, December 14, 2007, 11:57:38 PM, you wrote:
to allocate registers do exist. So I'm looking forward to the next version of GHC matching C's performance for inner loops of array manipulation code :-)
with support of loop unrolling,
GHC calls this "inlining".
smart register allocation,
This is being worked on actively, AFAIK.
strength reducing
Easy to implement (in theory) with GHC rewrite rules, AFAIK (or at least, Simon PJ suggested that that might be so in a mailing list post a few months ago.) Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "People tell me I look down / but I'm always standing sixty-six inches off the ground." -- Carrie Bradley

Hello Tim, Saturday, December 15, 2007, 7:10:26 AM, you wrote:
with support of loop unrolling,
GHC calls this "inlining".
1. loop unrolling means generating several iterations of loop body, so that, say, 100 iterations of *p++=*q++ becomes 25 iterations of *p++=*q++; *p++=*q++; *p++=*q++; *p++=*q++; 2. actually, ghc can't inline tail-recursive functions at all (although i don't checked this after 6.4) there are also many more optimization tricks. i don't think that modern compiler with optimization level comparable to gcc can be delivered without many man-years of development -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 12/15/07, Bulat Ziganshin
Hello Tim,
Saturday, December 15, 2007, 7:10:26 AM, you wrote:
with support of loop unrolling,
GHC calls this "inlining".
1. loop unrolling means generating several iterations of loop body, so that, say, 100 iterations of *p++=*q++ becomes 25 iterations of *p++=*q++; *p++=*q++; *p++=*q++; *p++=*q++;
I know what loop unrolling means. In a pure functional language, it reduces to inlining, because recursion is used instead of loops, and the inliner can do the job of inlining (a fixed number of) iterations of a recursive function -- I don't know if it does this now, but it would be easy to implement. You don't have to believe me -- read section 4.6 of the inliner paper: http://research.microsoft.com/~simonpj/Papers/inlining/
2. actually, ghc can't inline tail-recursive functions at all (although i don't checked this after 6.4)
It may be that GHC *doesn't* inline tail-recursive functions, but as I pointed out above (which I'm just getting directly from the paper), it would be easy to flip a switch and let it inline a fixed number of iterations of them.
there are also many more optimization tricks. i don't think that modern compiler with optimization level comparable to gcc can be delivered without many man-years of development
I think that's an awfully definite statement to make, given that C and Haskell are very different languages, given how many high-level optimizations are possible in Haskell that aren't in C, and given how much higher programmer productivity is in Haskell than C. For example, as above, loop unrolling turns out to be just a special case of inlining. That's not true in C. The simplicity of Haskell (or rather, Core) means it's easy to implement a lot of things with a great deal of generality, an advantage that gcc doesn't have. Or, I mean, feel free to insist things are impossible, but try not to stand in the way of the people who are doing them while you say so. :-) Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "It's easy to consider women more emotional than men when you don't consider rage to be an emotion." -- Brenda Fine

Hello Tim, Saturday, December 15, 2007, 5:35:03 PM, you wrote:
the inliner can do the job of inlining (a fixed number of) iterations of a recursive function -- I don't know if it does this now, but it would be easy to implement.
It may be that GHC *doesn't* inline tail-recursive functions, but as I pointed out above (which I'm just getting directly from the paper), it would be easy
i see your point - it's easy to implement everything in GHC. probably its authors was sleeping last 15 years :)
as above, loop unrolling turns out to be just a special case of inlining.
and ghc was so genuine that it was implemented general case without implementing special one :)
That's not true in C. The simplicity of Haskell (or rather, Core) means it's easy to implement a lot of things with a great deal of generality, an advantage that gcc doesn't have.
Core language has the same complexity for generating good code as C, C-- or LLVM
Or, I mean, feel free to insist things are impossible, but try not to stand in the way of the people who are doing them while you say so. :-)
you may believe in what you want. i prefer to say about real situation. if it will be possible to quickly write good Haskell compiler, it was be written many years ago -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sat, 15 Dec 2007 21:46:43 +0300
Bulat Ziganshin
you may believe in what you want. i prefer to say about real situation. if it will be possible to quickly write good Haskell compiler, it was be written many years ago
No-one is writing a commercial Haskell compiler yet (although there is at least one commercial Haskell-like language). What I mean is, the amount of "commercial-oriented" funding spent on GHC (as opposed to the "research-oriented" funding spent by Microsoft Research and various research bodies) is, as far as I know, zero. Incentives matter. If there were a commercial Haskell compiler, maybe we would see faster progress. -- Robin

Hi
No-one is writing a commercial Haskell compiler yet (although there is at least one commercial Haskell-like language). What I mean is, the amount of "commercial-oriented" funding spent on GHC (as opposed to the "research-oriented" funding spent by Microsoft Research and various research bodies) is, as far as I know, zero. Incentives matter. If there were a commercial Haskell compiler, maybe we would see faster progress.
It isn't just about money. It's also about ideas, luck and randomly bumping into people. I am sure there is a great strategy for making really fast Haskell compilers, and I am sure at some point we'll figure out what it is. I agree with Bulat that Haskell has, if anything, even better optimisation potential than something like C. With Haskell you can do the crazy high-level optimisations that things like C would demand really advanced alias-analysis. Compare this to low-level optimisations which in Haskell require strictness analysis but in C are easy. At some point high-level will become more important than low-level, when it does, we win :-) Thanks Neil

Hello Neil, Saturday, December 15, 2007, 10:07:54 PM, you wrote:
I agree with Bulat that Haskell has, if anything, even better optimisation potential than something like C. With Haskell you can do the crazy high-level optimisations that things like C would demand really advanced alias-analysis.
i think that this ability is already implemented in GHC. but low-level code generation, STG-to-Asm level, is very simple and may be compared only with 20-year old C compilers like MSC 5.1 -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Neil Mitchell wrote:
Hi
I agree with Bulat that Haskell has, if anything, even better optimisation potential than something like C. With Haskell you can do the crazy high-level optimisations that things like C would demand really advanced alias-analysis. Compare this to low-level optimisations which in Haskell require strictness analysis but in C are easy. At some point high-level will become more important than low-level, when it does, we win :-)
I had this conversation where Mr C++ basically said that my code implements 3 loops and it's "not possible" to optimise that into just 1 loop like the C program is doing. Then I (or rather, Don) demonstrated that the stream fusion library *has* optimised it into just 2 loops. (Apparently the library isn't 100% complete as yet.) I liken transformations like this to the sort of high-level optimisations that a database engine might do given an SQL statement. An SQL SELECT certainly *looks* just like a loop construct. But it isn't; it declares the result, not the algorithm, freeing the database to use *any* algorithm that produces the right answer. The result is that, as is well known, databases are supremely good at executing SQL queries blisteringly fast. When I see the compiler turn 3 loops into 2 by using algebraic properties of the program source code, that's how I think of it. Hmm, perhaps to really show this off, we need a more complicated program. Something that would be just too hard to implement as 1 loop in C, but is easy for the GHC optimiser to build. I shall meditate on this further...

andrewcoppin:
Neil Mitchell wrote:
Hi
I agree with Bulat that Haskell has, if anything, even better optimisation potential than something like C. With Haskell you can do the crazy high-level optimisations that things like C would demand really advanced alias-analysis. Compare this to low-level optimisations which in Haskell require strictness analysis but in C are easy. At some point high-level will become more important than low-level, when it does, we win :-)
I had this conversation where Mr C++ basically said that my code implements 3 loops and it's "not possible" to optimise that into just 1 loop like the C program is doing. Then I (or rather, Don) demonstrated that the stream fusion library *has* optimised it into just 2 loops. (Apparently the library isn't 100% complete as yet.)
Right, we haven't bothered with streaming versions of 'words'. However, the maximumBy and map happily fuse into a single loop.
Hmm, perhaps to really show this off, we need a more complicated program. Something that would be just too hard to implement as 1 loop in C, but is easy for the GHC optimiser to build. I shall meditate on this further...
Do you have the single loop C program, btw? I'd be curious to see if this is really "feasible". It would have to do the buffering, tokenising and accumulating in one go. I'd imagine it is a bit hairy. And, it should not significantly outperform, say, a bytestring version. If it does, I'd like to see that. -- Don

Hello Don, Saturday, December 15, 2007, 10:57:02 PM, you wrote:
Do you have the single loop C program, btw? I'd be curious to see if this is really "feasible". It would have to do the buffering, tokenising and accumulating in one go. I'd imagine it is a bit hairy.
for (int n; n = read (0, buf, 32768);) { for (char *p=buf,*end=buf+n;;) while (*p++==' ') if(p==end) goto end; while (*p++!=' ') if(p==end) goto end; words++; } end:; }
And, it should not significantly outperform, say, a bytestring version. If it does, I'd like to see that.
you are welcome :) OPT_FLAGS = -O3 -march=pentiumpro -mtune=pentiumpro \ -fomit-frame-pointer -fstrict-aliasing \ -ffast-math -fforce-addr -funroll-loops \ -fno-exceptions -fno-rtti -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Don,
Saturday, December 15, 2007, 10:57:02 PM, you wrote:
Do you have the single loop C program, btw? I'd be curious to see if this is really "feasible". It would have to do the buffering, tokenising and accumulating in one go. I'd imagine it is a bit hairy.
for (int n; n = read (0, buf, 32768);) { for (char *p=buf,*end=buf+n;;) while (*p++==' ') if(p==end) goto end; while (*p++!=' ') if(p==end) goto end; words++; } end:; }
How many loops is that, Bulat? :) -- Don

dons:
bulat.ziganshin:
Hello Don,
Saturday, December 15, 2007, 10:57:02 PM, you wrote:
Do you have the single loop C program, btw? I'd be curious to see if this is really "feasible". It would have to do the buffering, tokenising and accumulating in one go. I'd imagine it is a bit hairy.
for (int n; n = read (0, buf, 32768);) { for (char *p=buf,*end=buf+n;;) while (*p++==' ') if(p==end) goto end; while (*p++!=' ') if(p==end) goto end; words++; } end:; }
Oh, this isn't the original program, either. You need to find the longest word and print it. Not count the words. -- Don

Hello Don, Saturday, December 15, 2007, 11:28:00 PM, you wrote:
Do you have the single loop C program, btw? I'd be curious to see if
Oh, this isn't the original program, either. You need to find the longest word and print it. Not count the words.
i can't understand what you mean by single-loop program? my code does one pass through the data without generating any intermediate data structures and i think that it's not worse than output of stream fusion -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Don,
Saturday, December 15, 2007, 11:28:00 PM, you wrote:
Do you have the single loop C program, btw? I'd be curious to see if
Oh, this isn't the original program, either. You need to find the longest word and print it. Not count the words.
i can't understand what you mean by single-loop program? my code does one pass through the data without generating any intermediate data structures and i think that it's not worse than output of stream fusion
Yes, its very nice! That is the kind of code I'd *hope* to produce from a series of fused loops! A bunch of little loop bodies alternative control. I'm actually more interested in Andrew's mysterious 10x faster C program, someone provided to him. It solved the original problem of finding the longest word in the dictionary. (So slightly more complex). I'd expect a 2-3x worse bytestring program, in the worst case. So the 10x figure has been curious. -- Don

Don Stewart wrote:
andrewcoppin:
(Apparently the library isn't 100% complete as yet.)
Right, we haven't bothered with streaming versions of 'words'. However, the maximumBy and map happily fuse into a single loop.
Indeed. And hopefully when words gets implemented, we will have One Loop to rule them all... er... well you know what I mean.
Hmm, perhaps to really show this off, we need a more complicated program. Something that would be just too hard to implement as 1 loop in C, but is easy for the GHC optimiser to build. I shall meditate on this further...
Do you have the single loop C program, btw? I'd be curious to see if this is really "feasible". It would have to do the buffering, tokenising and accumulating in one go. I'd imagine it is a bit hairy.
And, it should not significantly outperform, say, a bytestring version. If it does, I'd like to see that.
First version: n = 0; while( n < FILE_SIZE ) { while( n < FILE_SIZE && file[n++] == ' ' ); wStart = n; while( n < FILE_SIZE && file[n++] != ' ' ); wLength = n - wStart; if( wLength > strlen( longestString ) ) strncpy( longestString , file + wStart , wLength ); } Takes 0.016 seconds to process a 2.4 MB file. [But not the same one Don used.] Then Mr C++ looked at it and said "OMG! You don't *never* use strlen() inside a loop!" and the second version was writting: file[FILE_SIZE] = ' '; n = 0; maxLength = 0; while( n < FILE_SIZE ) { while( file[n++] == ' ' ); wStart = n; while( file[n++] != ' ' ); wLength = n - wStart; if( wLength > maxLength ) { longestWordStart = wStart; maxLength = wLength; } } strncpy( longestString , file + longestWordStart , maxLength ); This version takes 0.005 seconds. I have no idea what kind of hardware this is running on - or even which compiler or OS.

Hello Andrew, Sunday, December 16, 2007, 1:39:02 AM, you wrote:
Takes 0.016 seconds to process a 2.4 MB file. [But not the same one Don used.]
This version takes 0.005 seconds.
Don, it seems to be bound by memory speed rather than quality of generated code. i suggest you to test it on fixed 32kb block processed many times -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Andrew Coppin wrote:
Then Mr C++ looked at it and said "OMG! You don't *never* use strlen() inside a loop!" and the second version was writting:
file[FILE_SIZE] = ' '; n = 0; maxLength = 0; while( n < FILE_SIZE ) { while( file[n++] == ' ' ); wStart = n; while( file[n++] != ' ' ); wLength = n - wStart; if( wLength > maxLength ) { longestWordStart = wStart; maxLength = wLength; } } strncpy( longestString , file + longestWordStart , maxLength );
This version takes 0.005 seconds.
Nice. I especially like the way it'll segfault if there is a blank at the end of the file. Roman

Roman Leshchinskiy wrote:
Andrew Coppin wrote:
Then Mr C++ looked at it and said "OMG! You don't *never* use strlen() inside a loop!" and the second version was writting:
Nice. I especially like the way it'll segfault if there is a blank at the end of the file.
That's why I love C so much. :-)

Hello Robin, Saturday, December 15, 2007, 9:54:43 PM, you wrote:
you may believe in what you want. i prefer to say about real situation. if it will be possible to quickly write good Haskell compiler, it was be written many years ago
No-one is writing a commercial Haskell compiler yet (although there is at least one commercial Haskell-like language). What I mean is, the amount of "commercial-oriented" funding spent on GHC (as opposed to the "research-oriented" funding spent by Microsoft Research and various research bodies) is, as far as I know, zero. Incentives matter. If there were a commercial Haskell compiler, maybe we would see faster progress.
yes, it's one of my points. among complexity of generating efficient code for high-level language, there are just too small resources spent here compared to icc/gcc/msvc. i don't think that good Haskell compiler (for simple loops) is impossible, i just don't see chances to get it in reality -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 12/15/07, Bulat Ziganshin
i see your point - it's easy to implement everything in GHC. probably its authors was sleeping last 15 years :)
As you well know, implementing things in GHC isn't always easy for people who aren't named "Simon", and people who are named "Simon" are often busy not so much with sleeping as with coming up with things that will lead to new papers rather than implementing straightforward things that are already pointed out in existing papers :-) (I know it's dangerous to call optimzations "straightforward" before you try to implement them, but even so.)
and ghc was so genuine that it was implemented general case without implementing special one :)
Isn't implementing the general case and leaving the users to use it to implement the special ones what functional programming is about? :-)
That's not true in C. The simplicity of Haskell (or rather, Core) means it's easy to implement a lot of things with a great deal of generality, an advantage that gcc doesn't have.
Core language has the same complexity for generating good code as C, C-- or LLVM
Sorry, I don't know what you mean here; I assume by "complexity" you don't mean "time complexity" or "space complexity", and anyway, I don't know what those would mean as applied to a programming language. Care to elaborate?
you may believe in what you want. i prefer to say about real situation. if it will be possible to quickly write good Haskell compiler, it was be written many years ago
As others have pointed out, I think that's false. Resources, financial and human, have been thrown at C compilation that have not been thrown at Haskell compilers. As hard-working as the people who work on Haskell compilers are, there aren't very many of them and none of them have "writing Haskell compilers" as a job description. Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "The blues isn't about feeling better, it's about making other people feel worse, and making a few bucks while you're at it." -- Bleeding Gums Murphy

Tim Chevalier wrote:
I think that's an awfully definite statement to make, given that C and Haskell are very different languages, given how many high-level optimizations are possible in Haskell that aren't in C, and given how much higher programmer productivity is in Haskell than C. For example, as above, loop unrolling turns out to be just a special case of inlining. That's not true in C. The simplicity of Haskell (or rather, Core) means it's easy to implement a lot of things with a great deal of generality, an advantage that gcc doesn't have.
While this is true in general, loop optimisations are not a particularly good example, IMO. For them to be effective, you often have to consider things like instruction scheduling and register pressure which you can't really do on the Core level. Roman

On 12/15/07, Roman Leshchinskiy
While this is true in general, loop optimisations are not a particularly good example, IMO. For them to be effective, you often have to consider things like instruction scheduling and register pressure which you can't really do on the Core level.
Fair enough for loop optimizations in general, but I think the point about loop unrolling from the "Secrets of the GHC Inliner" paper that I was referring to gives a counterexample to that. In imperative languages, loop unrolling and inlining would be thought of as distinct and very different optimizations -- in a pure functional language, they can be unified. It seems to me that working in a pure functional language makes it easy to write high-level optimizations that can be specified very simply on a level like the level of Core that can then be amplified by a smart backend that takes things like instruction scheduling into account. Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "Stupidity combined with arrogance and a huge ego will get you a long way." -- Chris Lowe

On 12/14/07, Dan Piponi
There have been some great improvements in array handling recently. I decided to have a look at the assembly language generated by some simple array manipulation code and understand why C is at least twice as fast as ghc 6.8.1. One the one hand it was disappointing to see that the Haskell register allocator seems a bit inept and was loading data into registers that should never have been spilled out of registers in the first place.
Someone who knows the backend better than I do can correct me if I'm wrong, but it's my understanding that GHC 6.8.1 doesn't even attempt to do any register allocation on x86. So -- register allocator? What register allocator? But this is being actively worked on; I understand there's code for it in the HEAD.
On the other hand, the code wasn't fundamentally doing anything weird (eg. it wasn't doing graph reductions or anything like that, it looked like the sort of loop you might get from a C compiler). It was a few seconds of fairly mindless work to fix up the assembly language and make it much faster. And if I can do it, it means that the next generation of Haskell compiler should be able to do it too, after all, good freely available methods to allocate registers do exist. So I'm looking forward to the next version of GHC matching C's performance for inner loops of array manipulation code :-)
It sounds like Team GHC is thinking about the exact same things you are here: http://hackage.haskell.org/trac/ghc/wiki/Status/Nov07 Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "Science fiction is not predictive; it is descriptive." --Ursula K. Le Guin

Tim Chevalier wrote:
It sounds like Team GHC is thinking about the exact same things you are here: http://hackage.haskell.org/trac/ghc/wiki/Status/Nov07
Thanks for posting that. I was unaware of that link, and it was very interesting reading. Jules

Jules Bean wrote:
Tim Chevalier wrote:
It sounds like Team GHC is thinking about the exact same things you are here: http://hackage.haskell.org/trac/ghc/wiki/Status/Nov07
Thanks for posting that. I was unaware of that link, and it was very interesting reading.
+1. Last time I checked, the last status update was from quite a long time ago...

Hello Andrew, Saturday, December 15, 2007, 1:17:56 PM, you wrote:
http://hackage.haskell.org/trac/ghc/wiki/Status/Nov07 Thanks for posting that. I was unaware of that link, and it was very interesting reading. +1.
obviously it's made for forthcoming HCAR but wasn't announced separately -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Tim Chevalier wrote:
On 12/14/07, Dan Piponi
wrote: There have been some great improvements in array handling recently. I decided to have a look at the assembly language generated by some simple array manipulation code and understand why C is at least twice as fast as ghc 6.8.1. One the one hand it was disappointing to see that the Haskell register allocator seems a bit inept and was loading data into registers that should never have been spilled out of registers in the first place.
Someone who knows the backend better than I do can correct me if I'm wrong, but it's my understanding that GHC 6.8.1 doesn't even attempt to do any register allocation on x86. So -- register allocator? What register allocator?
That's not entirely true - there is a fairly decent linear-scan register allocator in GHC http://darcs.haskell.org/ghc/compiler/nativeGen/RegAllocLinear.hs the main bottleneck is not the quality of the register allocation (at least, not yet). The first problem is that in order to get good performance when compiling via C we've had to lock various global variables into registers (the heap pointer, stack pointer etc.), which leaves too few registers free for argument passing on x86, so the stack is used too much. This is probably why people often say that the register allocator sucks - in fact it is really the calling convention that sucks. There is some other stupidness such as reloading values from the stack, though. Another problem is that the backend doesn't turn recursion into loops (i.e. backward jumps), so those crappy calling conventions are used around every loop. If we fixed that - which is pretty easy, we've tried it - then the bottleneck becomes the lack of loop optimisations in the native code generator, and we also run into the limitations of the current register allocator. Fortunately the latter has been fixed: Ben Lippmeier has written a graph-colouring allocator, and it's available for trying out in GHC HEAD. Fixing it all properly means some fairly significant architectural changes, and dropping the via-C backend (except for bootstrapping on new platforms), which is what we'll be doing in 6.10. I'd expect to see some dramatic improvements for those tight loops, in ByteString for example, but for "typical" Haskell code and GHC itself I'll be pleased if we get 10%. We'll see. Cheers, Simon

On 12/20/07, Simon Marlow
That's not entirely true - there is a fairly decent linear-scan register allocator in GHC
http://darcs.haskell.org/ghc/compiler/nativeGen/RegAllocLinear.hs
the main bottleneck is not the quality of the register allocation (at least, not yet).
The first problem is that in order to get good performance when compiling via C we've had to lock various global variables into registers (the heap pointer, stack pointer etc.), which leaves too few registers free for argument passing on x86, so the stack is used too much. This is probably why people often say that the register allocator sucks - in fact it is really the calling convention that sucks. There is some other stupidness such as reloading values from the stack, though.
[snipped further reasons] Thanks for enlightening me. (I had been opting to believe the various rumor and hearsay floating around rather than actually reading the source :-) One reason why I care about this is that over the summer I was trying to do some performance measurements for House. One of the experiments I did was measuring how long it took to run a loop of Haskell code that just did a no-op FFI call. This was still ten times slower than a loop in C that called the same no-op function. I looked at the generated code (with the native-code backend), noticed the issues you mentioned above (reloading values from the stack, and so on), and concluded that there was probably a good reason why the backend was being worked on actively. The -fvia-C code wasn't much better. However, this was with GHC 6.2, so obviously this suggests that porting House to a newer GHC version might be worthwhile for us to do :-) Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "Dare to be naive."--R. Buckminster Fuller

Hello Andrew, Friday, December 14, 2007, 11:12:18 PM, you wrote:
So it seems even with ByteStrings and stream fusion, we're still 10x slower than naive C. :-( You can console yourself that maybe 5
in the ByteString paper and in Parallel Arrays paper you can find that difference between Haskell and C code may be hundreds times, so this is definitely an improvement. all the words about haskell being as fast as C is just propaganda :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 12/14/07, Andrew Coppin
Well anyway, a few years ago we didn't have fusion,
This is not true. Shortcut deforestation (fusion) has been in GHC for at least fourteen years: http://citeseer.ist.psu.edu/gill93short.html It's true that we didn't have stream fusion a few years ago.
and we didn't have ByteString.
(true.) [snip]
All I know is that once again, it seems Haskell isn't as fast as I thought... *sigh*
No, what you know is that *one particular implementation* of Haskell isn't as fast as you thought. [snip]
A few years ago, the program would have taken 0.35 seconds. The end.
I would be careful about making this statement if I were you. Did you actually try running the same code in GHC 5.0? It's not as if GHC didn't implement *any* optimizations in 2003 (you said above that "0.35 seconds" was the result you got with -O0.) Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "Now, don't bother to flame me, because by the time you read this post I will be a different person from the one who wrote it. You cannot step in the same river twice." -- Sarah Barton

Andrew Coppin wrote:
Program with no particular optimisations: 0.35 seconds. Program with stream fusion [and GHC HEAD]: 0.25 seconds. Program with stream fusion and ByteString: 0.05 seconds.
Surely you'd have to work pretty hard to get that kind of speed even in C. ;-)
...erm, actually no. Somebody sat down and wrote something in five minutes that takes 0.005 seconds. Oops! You may also be paying a fixed cost penalty in GHC run-time initialization. Try increasing N and see what happens.
Paul.

Paul Johnson wrote:
Andrew Coppin wrote:
Program with no particular optimisations: 0.35 seconds. Program with stream fusion [and GHC HEAD]: 0.25 seconds. Program with stream fusion and ByteString: 0.05 seconds.
Surely you'd have to work pretty hard to get that kind of speed even in C. ;-)
...erm, actually no. Somebody sat down and wrote something in five minutes that takes 0.005 seconds. Oops! You may also be paying a fixed cost penalty in GHC run-time initialization. Try increasing N and see what happens.
Yeah. Hence the "we should use something that takes tens of seconds". ;-) (I suppose I could try writing a nop program and timing it. But personally I don't have any way of timing things to that degree of accuracy. I understand there are command line tools on Unix that will do it, but not here.)

On Dec 15, 2007 10:15 AM, Andrew Coppin
(I suppose I could try writing a nop program and timing it. But personally I don't have any way of timing things to that degree of accuracy. I understand there are command line tools on Unix that will do it, but not here.)
Like [1]? [1] http://shootout.alioth.debian.org/debian/benchmark.php?test=hello&lang=all -- Felipe.

Felipe Lessa wrote:
On Dec 15, 2007 10:15 AM, Andrew Coppin
wrote: I suppose I could try writing a nop program and timing it.
Like [1]?
[1] http://shootout.alioth.debian.org/debian/benchmark.php?test=hello&lang=all
Right. Like that. So now we have Don's timings for the Haskell program on his PC, Scott's timings for the C program on his PC, and the shootout's timings for a no-op Haskell program on their PC. I think what I need to do here is run a set of timings all on the same PC! ;-) I'll probably use my laptop. It has SuSE Linux, so it should also have a C compiler. (Or if it doesn't, I can easily get one. But I'm pretty sure GHC pulls it in as a dependency.) While I'm here, can somebody tell me the correct command to get a time and peak RAM usage printout for a program?

On Dec 15, 2007 11:21 AM, Andrew Coppin
While I'm here, can somebody tell me the correct command to get a time and peak RAM usage printout for a program?
For how the shootout guys did, see their FAQ [1] and the code of the benchmarker [2]. It seems that [3] is the file responsible of executing the tests, but it's pretty huge and my eyes don't really like reading Perl =). [1] http://shootout.alioth.debian.org/debian/faq.php [2] http://alioth.debian.org/scm/?group_id=30402 [3] http://alioth.debian.org/plugins/scmcvs/cvsweb.php/shootout/bin/minibench?cv... Cheers, -- Felipe.

Andrew Coppin wrote:
(I suppose I could try writing a nop program and timing it. But personally I don't have any way of timing things to that degree of accuracy. I understand there are command line tools on Unix that will do it, but not here.)
You can try for example this one http://www.pc-tools.net/win32/ptime/ to measure times better on windows. I tried the above prg few years ago and it seemed to work. If you do not mind installing cygwin then you can get time command from it. The only problem is that both ptime and cygwin time do not add times of child processes to the result. Unix tools do that by default (since child accounting info is added to parent process if the child is waited for). If you want to add children time to your result you probably need to write your own utility for win32 timing. It should be something like 100 lines of C code. See QueryInformationJobObject win32 api function to start. I know only one commercial tool which can take children time into account on windows. If you would write it and decide to release it under a free license, let me know :) Peter.

On 12/15/07, Andrew Coppin
(I suppose I could try writing a nop program and timing it. But personally I don't have any way of timing things to that degree of accuracy. I understand there are command line tools on Unix that will do it, but not here.)
Try the -Rghc-timing flag. Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "and there's too much darkness in an endless night to be afraid of the way we feel" -- Bob Franke

Tim Chevalier wrote:
Try the -Rghc-timing flag.
Interesting, that one does not work in my program compiled with ghc 6.8.1 (looks like ghc runtime does not consume it but passes it to my haskell code). +RTS -tstderr works but its usability is limited since it provides only elapsed time and not the process cpu times. Peter.

On 12/15/07, Peter Hercek
Tim Chevalier wrote:
Try the -Rghc-timing flag.
Interesting, that one does not work in my program compiled with ghc 6.8.1 (looks like ghc runtime does not consume it but passes it to my haskell code). +RTS -tstderr works but its usability is limited since it provides only elapsed time and not the process cpu times.
Sorry, my mistake -- it's an RTS option, so: ./program +RTS -Rghc-timing -RTS and I guess you have to compile with -prof. Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "Live fast, love hard, and wear corrective lenses if you need them." --Webb Wilder

Tim Chevalier wrote:
On 12/15/07, Peter Hercek
wrote: Try the -Rghc-timing flag. Interesting, that one does not work in my program compiled with ghc 6.8.1 (looks like ghc runtime does not consume it but passes it to my haskell code). +RTS -tstderr works but its usability is
Tim Chevalier wrote: limited since it provides only elapsed time and not the process cpu times.
Sorry, my mistake -- it's an RTS option, so:
./program +RTS -Rghc-timing -RTS
and I guess you have to compile with -prof.
I guess it is just buggy in 6.8.1. That option does not seem to work, not even as an RTS option and even when I compile with "-prof -auto-all". But the user guide states that the result should be the same as with "+RTS -tstderr" and if so then it is not that interesting (since cpu times are missing). Btw, "+RTS -tstderr" works without -prof too, which is nice :) I liked the idea that ghc generated exe can report its times too (I meant also cpu times and not only the elapsed time) but external programs work well for this too, so never mind. Thanks, Peter.
participants (13)
-
Andrew Coppin
-
Bulat Ziganshin
-
Dan Piponi
-
Don Stewart
-
Felipe Lessa
-
Jules Bean
-
Neil Mitchell
-
Paul Johnson
-
Peter Hercek
-
Robin Green
-
Roman Leshchinskiy
-
Simon Marlow
-
Tim Chevalier