Performance: MD5

Hi folks. OK, try this: darcs get http://darcs.orphi.me.uk/MyMD5 cd MyMD5 ghc -O2 --make md5sum md5sum <some large filename> On my PC, it takes 3.5 seconds to compute the MD5 hash of a 10 MB file. For comparison, the md5.exe I downloaded from the Internet does it instantaneously. It's literally so fast I can't even time it. As far as I know, my program always produces the *correct* answer, it just takes far too long to do it. If anybody has any interesting insights as to why my version is so damned slow, I'd like to hear them. (Indeed, a description of the process for tracking the problem down would also be useful!) [Much to my bemusement, when I had a look at the code, I discovered that tucked away in a corner somewhere, it reads a ByteString from disk and instantly converts it to [Word8]. Uh... oops? Anyway, eliminating this changed the numbers I get from the profiler, but it's still far too slow.]

Hello Andrew, Saturday, May 17, 2008, 6:51:27 PM, you wrote:
If anybody has any interesting insights as to why my version is so damned slow, I'd like to hear them.
i equally will be interesting to hear why you think what your program should be as fast as C version? you wrote it in purely functional style. as Don wrote, if you want to reach unoptimized C-like performance, you need to write in C style - with explicit loop processing primitive types. so 1) -funbox-strict-fields 2) don't use tuples as arguments - they are lazy. you may implement strict tuples instead 3) function as a parameter is very bad unless it will be inlined but these are only half-decisions. if you need maximum efficiency, drop all this high-level code and map md5.c to haskell as it was done in Don's bloag -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
if you need maximum efficiency, drop all this high-level code and map md5.c to haskell as it was done in Don's blog
Wait... you're telling me to give up? To not even try?? So all that bravado about Haskell enabling higher-level optimisations to produce a result faster than C is actually complete nonesense? What an awesome "community" this language has behind it... I'm not sure why I bother.

Hello Andrew, Sunday, May 18, 2008, 12:40:20 PM, you wrote:
So all that bravado about Haskell enabling higher-level optimisations to produce a result faster than C is actually complete nonesense?
yes. look at bytestring implementation, for example. sorry, but you shouldn't believe in something until you tried it yourself :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hi
if you need maximum efficiency, drop all this high-level code and map md5.c to haskell as it was done in Don's blog
Wait... you're telling me to give up? To not even try??
He's saying use the resources the community have already provided, to write it in a lower-level style, following much the same pattern as Don did.
So all that bravado about Haskell enabling higher-level optimisations to produce a result faster than C is actually complete nonesense?
Nope. If you just code up a simple algorithm in C and in Haskell, it is pretty unlikely the Haskell one will go as fast as the C one. But Haskell (in particular GHC) does have powerful low-level features that means you can probably draw with C. One day, I hope that high-level Haskell will outperform C all the time. It's already true with some of the ByteString stuff, but hopefully one day it will be a normal result. It's pretty much the goal of this optimiser: http://www-users.cs.york.ac.uk/~ndm/supero/ Thanks Neil

Hello Neil, Sunday, May 18, 2008, 12:54:30 PM, you wrote:
One day, I hope that high-level Haskell will outperform C all the time. It's already true with some of the ByteString stuff, but
i don't think so. all the tests where Haskell "outperform" C i've seen was just unfair - for example they compare results of MANUALLY-unrolled Haskell loops with rolled C ones - totally ignoring the fact that gcc can unroll loops just by enabling option while with ghc this can be done only by hand. many tests that show "equal performance" just limited by memory bandwidth. and so on it's hard to write program that outperforms gcc-generated code even using assembler, and ghc is limited to this fact, too :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

andrewcoppin:
Bulat Ziganshin wrote:
if you need maximum efficiency, drop all this high-level code and map md5.c to haskell as it was done in Don's blog
Wait... you're telling me to give up? To not even try??
So all that bravado about Haskell enabling higher-level optimisations to produce a result faster than C is actually complete nonesense?
What optimisations were operating in your md5 implementation that you could expect it do match the C version?
What an awesome "community" this language has behind it... I'm not sure why I bother.
Andrew, take the time to understand what you're code actually does when you write it. If you want to write code that matches some heavily optimised md5 in C (did you look at the C implementation?) then your Haskell will need to compile down to similar data and control structures. Does it? So, as Bulat says, explain why your code should have similar performance to the C version. It would be much more useful than complaining about the community who patiently help you day in and out. -- Don

Don Stewart wrote:
andrewcoppin:
Wait... you're telling me to give up? To not even try??
So all that bravado about Haskell enabling higher-level optimisations to produce a result faster than C is actually complete nonesense?
What optimisations were operating in your md5 implementation that you could expect it do match the C version?
I was hoping that things would be inlined, small loops unrolled, and I'd end up with more or less the same imperative loops as the C version has. I wasn't expecting to *beat* C in this particular case, but I expected to get somewhere fairly near it.
If you want to write code that matches some heavily optimised md5 in C then your Haskell will need to compile down to similar data and control structures.
Agreed.
(did you look at the C implementation?)
I can't read C. (FWIW, I think I did briefly stare at the sources, but eventually gave up because I simply had no clue what's going on.)
Does it?
I'm still figuring out how to answe that question. [As in, "the profiler says 60% of my time is spent inside one function, I've now tried writing that function 6 different ways, and currently it compiles to about 3 pages of Core that I'm still attempting to decypher". Given the size of the Core for this function, I'd say it's probably inlining and unrolling just fine...]
So, as Bulat says, explain why your code should have similar performance to the C version.
Because it executes the same algorithm? I mean, there's essentially only one way round that you can perform the MD5 algorithm, so it just comes down to how efficiently you execute the steps involved.
It would be much more useful than complaining about the community who patiently help you day in and out.
Telling somebody "don't bother trying, it's impossible" doesn't strike me as terribly helpful. It doesn't make my code go any faster, and it certainly doesn't help me learn anything. I guess my responce was a little harsh. But when you invest a lot of time and effort in something, and somebody off-handedly tells you you're wasting your time... it's quite upsetting.

On Sun, 2008-05-18 at 11:20 +0100, Andrew Coppin wrote:
So, as Bulat says, explain why your code should have similar performance to the C version.
Because it executes the same algorithm? I mean, there's essentially only one way round that you can perform the MD5 algorithm, so it just comes down to how efficiently you execute the steps involved.
Ah if only computation were that simple. One might say that the ultimate perfect compiler would have this property. However it's provably impossible. See for example the Full Employment Theorem for Compiler Writers[1]. Duncan [1] http://en.wikipedia.org/wiki/Full_employment_theorem

On 2008-05-18, Andrew Coppin
(did you look at the C implementation?)
I can't read C. (FWIW, I think I did briefly stare at the sources, but eventually gave up because I simply had no clue what's going on.)
It's worth learning. It's still the only widely used abstract portable assembler with fairly standard ABIs for each platform. Go read K&R[1]. It shouldn't take more than a week's worth of spare time. [1] The C Programming Language (2nd Edition), Kernigan and Ritchie, Prentice Hall, 1998 http://www.amazon.com/Programming-Language-Prentice-Hall-Software/dp/0131103... -- Aaron Denney -><-

Aaron Denney
Go read K&R[1]. It shouldn't take more than a week's worth of spare time.
HELL NO! There's a reason why my lecturer always refered to it as "Knall & Rauch C" (Bang and Smoke C). Get the Harbison & Steele instead: http://careferencemanual.com/ -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

On 2008-05-18, Achim Schneider
Aaron Denney
wrote: Go read K&R[1]. It shouldn't take more than a week's worth of spare time.
HELL NO!
There's a reason why my lecturer always refered to it as "Knall & Rauch C" (Bang and Smoke C).
Get the Harbison & Steele instead: http://careferencemanual.com/
C is a little language. It doesn't /need/ a 500 page tome. K&R is not something to learn programming, and of course, does quite poorly when used as a textbook for that purpose. It is, instead, for programmers to learn C. -- Aaron Denney -><-

On 2008 May 18, at 4:40, Andrew Coppin wrote:
Bulat Ziganshin wrote:
if you need maximum efficiency, drop all this high-level code and map md5.c to haskell as it was done in Don's blog
Wait... you're telling me to give up? To not even try??
Bulat is the naysayer of the group; according to him Haskell is a nice idea that will never actually work (but he uses it anyway, who knows how he rationalizes it to himself). -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Brandon S. Allbery KF8NH wrote:
Bulat is the naysayer of the group; according to him Haskell is a nice idea that will never actually work (but he uses it anyway, who knows how he rationalizes it to himself).
Bulat is apparently not alone in this belief. I seem to spend all my time on other forums dealing with people who have exactly the same opinion. Of course, being able to come up with clean, elegant code which never the less performs well is a good way to counter this. Pity I haven't succeeded in producing any yet. :-S

On 2008 May 18, at 9:55, Andrew Coppin wrote:
Brandon S. Allbery KF8NH wrote:
Bulat is the naysayer of the group; according to him Haskell is a nice idea that will never actually work (but he uses it anyway, who knows how he rationalizes it to himself).
Bulat is apparently not alone in this belief. I seem to spend all my time on other forums dealing with people who have exactly the same opinion.
There's a difference. When I started (GHC 6.4.2 was current), Bulat was spending his mailing list time denying the possibility of what DPH does now, and claiming that what GHC 6.8.2 does now was unlikely. (Meanwhile, I can't say much for "oh, i didn't understand that code, but surely we should be able to do at least as good without performance hacks?" when the code you didn't understand is all performance hacks.... You really have no business trying to draw comparisons when you don't even know when you're comparing apples to aardvarks, let alone apples to oranges.) Optimization is hard. Don't like it? Become a compiler researcher and find better ways to do it. Note that C / procedural language optimization research has at least a 20-year head start on functional programming optimization, and that the problems are very different: the C world would love to be at the point where optimizing the C equivalent of "sum xs / length xs" is worth thinking about; they're still not really in a position to *detect* it unless the language is simple enough to make such reasoning relatively easy (e.g. FORTRAN). -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Hello Brandon, Sunday, May 18, 2008, 6:25:31 PM, you wrote:
There's a difference. When I started (GHC 6.4.2 was current), Bulat was spending his mailing list time denying the possibility of what DPH does now, and claiming that what GHC 6.8.2 does now was unlikely.
what DPH and 6.8.2 does? i said that ghc can't reach the same level of optimization as gcc and it's still true - for example, it can't unroll loops. ghc 6.6 generates very simple code which holds all the vars in the memory, 6.8 can hold them in registers. gcc implements much more sophisticated optimizations, just try to implement something larger than one-liners in imperative haskell and C and compare results. it seems that you never developed highly-optimized programs in C, so all your reasoning is just product of haskell-fanatic imagination show me your code. or shut up, please -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Brandon, Sunday, May 18, 2008, 6:25:31 PM, you wrote:
Optimization is hard. Don't like it? Become a compiler researcher and find better ways to do it. Note that C / procedural language optimization research has at least a 20-year head start on functional programming optimization, and that the problems are very different: the C world would love to be at the point where optimizing the C equivalent of "sum xs / length xs" is worth thinking about; they're still not really in a position to *detect* it unless the language is simple enough to make such reasoning relatively easy (e.g. FORTRAN).
and a few lines later you repeat what i said few years ago - ghc optimization research has got 100x times less attention than C++ optimization so we can't expect the same results in the next 5-10 years at least. strange that you still believe that ghc can optimize as good as gcc -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 2008 May 18, at 10:45, Bulat Ziganshin wrote:
and a few lines later you repeat what i said few years ago - ghc optimization research has got 100x times less attention than C++ optimization so we can't expect the same results in the next 5-10
Mmm, no. You tend to say, or at least very strongly imply, "never" --- *not* "5-10 years". (Perhaps this is why you are rarely listened to. Also, perhaps you need to consider how you present things, if that's not actually what you mean.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Hello Brandon, Sunday, May 18, 2008, 6:58:34 PM, you wrote:
and a few lines later you repeat what i said few years ago - ghc optimization research has got 100x times less attention than C++ optimization so we can't expect the same results in the next 5-10
Mmm, no. You tend to say, or at least very strongly imply, "never" --- *not* "5-10 years". (Perhaps this is why you are rarely listened to. Also, perhaps you need to consider how you present things, if that's not actually what you mean.)
we don't have ghc optimized as good as gcc. when i tell about it, you tell that i'm not right. then you tell that ghc may be improved in some future - you don't know exactly. and now you started to watch out every word i say. well, i can do the same - why you sure that C optimization was started 20 years before FP optimization? why you sure that DPH does that i said will be never possible. please answer each question in detail before trying to critique me. and don't forget that we still wait from you your great strategy of optimization for C-like performance using all those fancy DPH/6.8.2 features. or you can only tell, and tell, and tell but never actually optimized anything? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 2008 May 18, at 11:10, Bulat Ziganshin wrote:
we don't have ghc optimized as good as gcc. when i tell about it, you tell that i'm not right. then you tell that ghc may be improved in
You're doing a remarkably good job of not hearing what I say. Hard numbers or etc. do nothing whatsoever in the face of "my oresentation style is relentless negativity", because I have every expectation based on past experience that you will find something else to be negative about. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Hello Brandon, Sunday, May 18, 2008, 7:15:17 PM, you wrote:
we don't have ghc optimized as good as gcc. when i tell about it, you tell that i'm not right. then you tell that ghc may be improved in
You're doing a remarkably good job of not hearing what I say. Hard
may be. so, when i proposed to use low-level optimization, you contradicted me. lets show us the better way to optimize this particular program if you know one or stop bothering me -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 2008 May 18, at 11:21, Bulat Ziganshin wrote:
Hello Brandon,
Sunday, May 18, 2008, 7:15:17 PM, you wrote:
we don't have ghc optimized as good as gcc. when i tell about it, you tell that i'm not right. then you tell that ghc may be improved in
You're doing a remarkably good job of not hearing what I say. Hard
may be. so, when i proposed to use low-level optimization, you contradicted me. lets show us the better way to optimize this particular program if you know one or stop bothering me
Elide the part that says that it's not worth it and repeat yourself yet again? Bah. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Hello Brandon, Sunday, May 18, 2008, 7:24:12 PM, you wrote:
You're doing a remarkably good job of not hearing what I say. Hard
Elide the part that says that it's not worth it and repeat yourself yet again? Bah.
Brandon, you've contradicted to my USEFUL suggestion with a tons of USELESS sheet and still believe that you are right. well, you are fanatic which doesn't can't make anything useful but prefer to believe that his great thoughts was not heard. you can continue to believe, i will continue to help haskellers. RIP -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

As a side note on this discussion, isn't it so that the current CPU hardware evolved in such a way that it is better suited for imperative programs? (I'm not talking about the GPU, that's another story). I mean, hardware gets optimized to run applications faster, but these applications are mainly written in C/C++, so you get a nice feedback loop here... Is it possible to design hardware that is better suitable for functional languages?

Hello Peter, Sunday, May 18, 2008, 7:27:35 PM, you wrote:
As a side note on this discussion, isn't it so that the current CPU hardware evolved in such a way that it is better suited for imperative programs? (I'm not talking about the GPU, that's another story). I mean, hardware gets optimized to run applications faster, but these applications are mainly written in C/C++, so you get a nice feedback loop here...
i don't think it's fundamental limit. hardware is better suited to assembler, naturally, but C++ compilers overruled asm. haskell-like languages can be theoretically compiled to the code which is as good as asm, it's just not the current state-of-the-art nevertheless, it's true that new hardware arise and it's two-way road: FP languages are used more due to multi-core hardware, and new hardware becomes practically useful due to new ways of programming -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Peter Verswyvelen
Is it possible to design hardware that is better suitable for functional languages?
As I recall, Lisp machines went out of business when Lisp ran faster on industry standard, 68000-based Suns and Apollos, than on their custom hardware with tags and whatnot. Anyway - with more and more aggressive caching, and ever increasing number of cores, it looks like C is an even poorer fit for the hardware than it used to be - and the Vax model has been a gross approximation for some time now. Threading and immutable data is getting popular even in the bare-metal imperative camp. Perhaps it is time to break out of the loop? -k -- If I haven't seen further, it is by standing in the footprints of giants

Hello Ketil, Sunday, May 18, 2008, 7:44:38 PM, you wrote:
Anyway - with more and more aggressive caching, and ever increasing number of cores, it looks like C is an even poorer fit for the hardware than it used to be
doubling number of cores every 1.5 years, we should see 1000-core monsters just next decade. so, i believe that next decade we will see the next big language shift and it will be shift toward FP languages. but Erlang still has more chances - Haskell is too complex, too slow and not advertised as multithreading language as wide as Erlang was just now, MS pushes F#. waiting for Sun :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Peter Verswyvelen wrote:
As a side note on this discussion, isn't it so that the current CPU hardware evolved in such a way that it is better suited for imperative programs? (I'm not talking about the GPU, that's another story). I mean, hardware gets optimized to run applications faster, but these applications are mainly written in C/C++, so you get a nice feedback loop here...
Is it possible to design hardware that is better suitable for functional languages?
I wondered about that myself a few times. For example, current computer designs revolve around a single complex, fast CPU connected to some slow RAM. [Or, more recently, a very small number of seperate CPUs.] I wonder what would happen if you instead had a vast number of very simple proto-processors connected in a vast network. [But I'm guessing the first thing that'll happen is that the data is never where you want it to be...] At any rate, custom hardware vs IA32 is rather like GHC vs GCC - one has been around for a hell of a lot longer, and had vastly more R&D thrown at it. Unless you can find some crucial insight that gives you a very big advantage, you're not going to get anywhere near the market leader with anything you can come up with by yourself. (It does seem to be that the major trouble with modern processors is that they only go fast if there are no cache misses. Back in the days when RAM was as fast as CPU, you could structure your program any way you like and it would just work. Today you have to jump through loops to make sure the cache behaviour is good, or you can just forget it and go home now. That tends to be a pretty big problem for any programming language with automatic memory management - Java immediately springs to mind, but of course Haskell is in the same boat. I get the vague impression that hardware designers are starting to take notice of the problem, but it's difficult to see how they could design anything differently. We'll see I guess... Certainly if hardware appears that handles OOP languages better, it's likely to handle Haskell better too.)

Andrew Coppin
I wonder what would happen if you instead had a vast number of very simple proto-processors connected in a vast network. [But I'm guessing the first thing that'll happen is that the data is never where you want it to be...]
You're not thinking of neuronal networks, are you? The interesting thing there is that they unite code and data. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

Achim Schneider wrote:
Andrew Coppin
wrote: I wonder what would happen if you instead had a vast number of very simple proto-processors connected in a vast network. [But I'm guessing the first thing that'll happen is that the data is never where you want it to be...]
You're not thinking of neuronal networks, are you? The interesting thing there is that they unite code and data.
Damn; you've seen through my cunning disguise. ;-) In all seriousness, it's easy enough to build an artificial neural network that computes a continuous function of several continuous inputs. But it's much harder to see how, for example, you'd build a text parser or something. It's really not clear how you implement flow control with this kind of thing. It's so different to a Turing machine is appears to render most of current computer science irrelevant. And that's *a lot* of work to redo. Now, if you had a network of something a bit more complicated than artificial neurons, but less complicated than an actual CPU... you'd have... I don't know, maybe something useful? It's hard to say.

Andrew Coppin
Achim Schneider wrote:
Andrew Coppin
wrote: I wonder what would happen if you instead had a vast number of very simple proto-processors connected in a vast network. [But I'm guessing the first thing that'll happen is that the data is never where you want it to be...]
You're not thinking of neuronal networks, are you? The interesting thing there is that they unite code and data.
Damn; you've seen through my cunning disguise. ;-)
In all seriousness, it's easy enough to build an artificial neural network that computes a continuous function of several continuous inputs. But it's much harder to see how, for example, you'd build a text parser or something. It's really not clear how you implement flow control with this kind of thing. It's so different to a Turing machine is appears to render most of current computer science irrelevant. And that's *a lot* of work to redo.
Hmmm... fuzzy logic, plus a lot of serialisation of parallel (that is, right now saved in linear ram) data. Don't make me think about it, or I'll be forced to confuse you with ramblings about how the brain works. (Is that control flow? ;)
Now, if you had a network of something a bit more complicated than artificial neurons, but less complicated than an actual CPU... you'd have... I don't know, maybe something useful? It's hard to say.
You'd have something like a cell processor, if you go for (more or less) normal control flow. Maybe we will soon see dedicated pointer rams, because hardware manufacturers despair while trying to design a cache manager for 1024 cores: That would make it easy to spot which core holds which pointer, and thus also easy to move the data to it. How would a Haskell compiler look like that targets a FPGA? That is, compiling down to configware, not to a RTS built on top of it. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

Hi
How would a Haskell compiler look like that targets a FPGA? That is, compiling down to configware, not to a RTS built on top of it.
http://www-users.cs.york.ac.uk/~mfn/reduceron2/ Thanks Neil

"Neil Mitchell"
How would a Haskell compiler look like that targets a FPGA? That is, compiling down to configware, not to a RTS built on top of it.
I'm only on page 5 and already drooling. fact n = 1 (n (==)) 1 (fact (1 (n (-))) (n (*))) looks somewhat suspiciously like forth to me. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

Achim Schneider wrote:
"Neil Mitchell"
wrote: How would a Haskell compiler look like that targets a FPGA? That is, compiling down to configware, not to a RTS built on top of it.
I'm only on page 5 and already drooling.
Damnit, this paper is making me hungry! :-( I almost want to run out into the street and purchase some FPGA hardware right now! [I mean, I wanted to before, but this makes me want to even more.] But hey, let's not get crazy here. (Yet.)

On Sun, May 18, 2008 at 9:09 PM, Andrew Coppin
Achim Schneider wrote:
"Neil Mitchell"
wrote: How would a Haskell compiler look like that targets a FPGA? That is,
compiling down to configware, not to a RTS built on top of it.
http://www-users.cs.york.ac.uk/~mfn/reduceron2/http://www-users.cs.york.ac.uk/%7Emfn/reduceron2/
I'm only on page 5 and already drooling.
Damnit, this paper is making me hungry! :-(
I almost want to run out into the street and purchase some FPGA hardware right now! [I mean, I wanted to before, but this makes me want to even more.] But hey, let's not get crazy here. (Yet.)
Video presentation: http://video.google.com/videoplay?docid=-1518197558546337776 -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

Yes, some things might be good for both language paradigms, e.g. * hardware garbage collection * hardware transactional memory With a bit of Googling, both seem to exists, the latter even being supported by SUN Andrew Coppin wrote:
Peter Verswyvelen wrote:
As a side note on this discussion, isn't it so that the current CPU hardware evolved in such a way that it is better suited for imperative programs? (I'm not talking about the GPU, that's another story). I mean, hardware gets optimized to run applications faster, but these applications are mainly written in C/C++, so you get a nice feedback loop here...
Is it possible to design hardware that is better suitable for functional languages?
I wondered about that myself a few times.
For example, current computer designs revolve around a single complex, fast CPU connected to some slow RAM. [Or, more recently, a very small number of seperate CPUs.] I wonder what would happen if you instead had a vast number of very simple proto-processors connected in a vast network. [But I'm guessing the first thing that'll happen is that the data is never where you want it to be...]
At any rate, custom hardware vs IA32 is rather like GHC vs GCC - one has been around for a hell of a lot longer, and had vastly more R&D thrown at it. Unless you can find some crucial insight that gives you a very big advantage, you're not going to get anywhere near the market leader with anything you can come up with by yourself.
(It does seem to be that the major trouble with modern processors is that they only go fast if there are no cache misses. Back in the days when RAM was as fast as CPU, you could structure your program any way you like and it would just work. Today you have to jump through loops to make sure the cache behaviour is good, or you can just forget it and go home now. That tends to be a pretty big problem for any programming language with automatic memory management - Java immediately springs to mind, but of course Haskell is in the same boat. I get the vague impression that hardware designers are starting to take notice of the problem, but it's difficult to see how they could design anything differently. We'll see I guess... Certainly if hardware appears that handles OOP languages better, it's likely to handle Haskell better too.)
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hello Brandon, Sunday, May 18, 2008, 5:48:48 PM, you wrote:
Bulat Ziganshin wrote:
if you need maximum efficiency, drop all this high-level code and map md5.c to haskell as it was done in Don's blog
Bulat is the naysayer of the group; according to him Haskell is a nice idea that will never actually work (but he uses it anyway, who knows how he rationalizes it to himself).
it significantly depends on the group, actually :) a few weeks ago i tried to convince managers of large software company to use haskell for development of complex algorithms. but they don't believe that using new language instead of C++ can raise productivity several times and told me that for real programmer it should be no matter which language to use. OTOH when i say here that haskell produce slow code, you can't belive me and saying that haskell is so great that it can't have any shortages. well, i use Haskell for complex algorithms and C++ for performance-critical parts of program which resulted in development of world-fastest archiver. if you know how to write high-level haskell programs that are as fast as C++ code - please tell us just now -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Andrew Coppin wrote:
So all that bravado about Haskell enabling higher-level optimisations to produce a result faster than C is actually complete nonesense?
Nonsense? No. Something that actually exists? No. Haskell enables optimisations. Writing a SufficientlySmartCompiler ( http://c2.com/cgi/wiki?SufficientlySmartCompiler ) still requires work. Jules

Andrew, I spent a reasonable amount of time making pureMD5 (available on hackage) faster, which mainly ment strictness annoitations and unboxing strict fields, but I also spent a good deal of time with the profiler. One of my early versions was fairly slow due to the converting of the LPS to blocks (an Isabelle proven 'blocks' function) caused O(n) space requirement. I've been meaning to revisit this to optimize more and look closly at GHC core. I'm on vacation now, but when I get back I'll try to make time to look at your code (unless its moot by then). Finally, I encourage anyone interested to make reasonably fast bytestring implementations of SHA algorithms as well - Haskell/Hackage has a number of pure and FFI implementations of MD5 but is a bit lacking in any other cryptographic algorithm. TomMD On Sat, 2008-05-17 at 15:51 +0100, Andrew Coppin wrote:
Hi folks.
OK, try this:
darcs get http://darcs.orphi.me.uk/MyMD5 cd MyMD5 ghc -O2 --make md5sum md5sum <some large filename>
On my PC, it takes 3.5 seconds to compute the MD5 hash of a 10 MB file. For comparison, the md5.exe I downloaded from the Internet does it instantaneously. It's literally so fast I can't even time it. As far as I know, my program always produces the *correct* answer, it just takes far too long to do it.
If anybody has any interesting insights as to why my version is so damned slow, I'd like to hear them. (Indeed, a description of the process for tracking the problem down would also be useful!)
[Much to my bemusement, when I had a look at the code, I discovered that tucked away in a corner somewhere, it reads a ByteString from disk and instantly converts it to [Word8]. Uh... oops? Anyway, eliminating this changed the numbers I get from the profiler, but it's still far too slow.]
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Andrew Coppin wrote:
If anybody has any interesting insights as to why my version is so damned slow, I'd like to hear them.
OK, so I am now officially talking to myself. Profiling indicates that 60% of my program's time is spent inside one function: make_block_bs :: ByteString -> Block In other words, the program spends 60% of its running time converting an array of Word8 into an array of Word32. [Note that the MD5 algorithm incorrectly specifies that little-endian ordering should be used, so I do the byte-swapping in here too.] The code for breading the file into correctly-sized pieces and adding padding is a ful 3% of the runtime, so it's not even the copying overhead. It's *all* in the array conversions. How much do you want to bet that the C version just reads a chunk of data into RAM and then does a simple type-cast? (If I'm not very much mistaken, in C a type-cast is an O(0) operation.) Anyway, wading through the 3-page Core output for this function, I'm seeing a lot of stuff of the form case GHC.Prim.<=# 0 whuhrjhsd of wild3784_hguhse { GHC.Base.False -> (GHC.Err.error @ GHC.Base.Int MD5.Types.lvl `cast` (CoUnsafe ... I'd be lying if I had I had any real idea what that does. But it occurs to me that the code is probably performing runtime checks that every single array access is within bounds, and that every single bit shift operation is similarly within bounds. This is obviously very *safe*, but possibly not very performant. Isn't there some function kicking around somewhere for acessing arrays without the bounds check? Similarly, I'm sure I remember the Data.Bits bounds check being mentioned before, and somebody saying they had [or were going to?] make an unsafe variant available. I've looked around the library documentation and I'm not seeing anything... any hints? (Obviously I could attempt to dig through GHC.Prim and find what I'm after there, but... it looks pretty scary in there! I'm hoping for something slightly higher-level if possible.) The alternative, obviously, is to just perform an evil type cast. Trouble is, that'll work just fine on any architecture with a little-endian data order [conspicuously IA32 and AMD64], and malfunction horribly on any architecture that works correctly. (At which point I start wondering how C manages to overcome this problem...)

Hello Andrew, Sunday, May 18, 2008, 3:42:23 PM, you wrote:
How much do you want to bet that the C version just reads a chunk of data into RAM and then does a simple type-cast? (If I'm not very much mistaken, in C a type-cast is an O(0) operation.)
try unsafeCoerce# for that. or just use hGetArray to read directly into array of type you need -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Andrew,
Sunday, May 18, 2008, 3:42:23 PM, you wrote:
How much do you want to bet that the C version just reads a chunk of data into RAM and then does a simple type-cast? (If I'm not very much mistaken, in C a type-cast is an O(0) operation.)
try unsafeCoerce# for that.
Yeah, I might end up being forced to do that. [Although obviously it'll break on big-endian architectures.]
or just use hGetArray to read directly into array of type you need
hGetArray :: Handle -> IOUArray Int Word8 -> Int -> IO Int (But yeah, I can coerce it afterwards.)

Hello Andrew, Sunday, May 18, 2008, 3:42:23 PM, you wrote:
Isn't there some function kicking around somewhere for acessing arrays without the bounds check? Similarly, I'm sure I remember the Data.Bits bounds check being mentioned before, and somebody saying they had [or were going to?] make an unsafe variant available. I've looked around the library documentation and I'm not seeing anything... any hints?
unsafeRead/Write (I# a) <<# (I# b) = (I# (a `iShiftL#` b)) (I# a) >># (I# b) = (I# (a `uncheckedIShiftRL#` b)) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Andrew,
Sunday, May 18, 2008, 3:42:23 PM, you wrote:
Isn't there some function kicking around somewhere for acessing arrays without the bounds check? Similarly, I'm sure I remember the Data.Bits bounds check being mentioned before, and somebody saying they had [or were going to?] make an unsafe variant available. I've looked around the library documentation and I'm not seeing anything... any hints?
unsafeRead/Write
Found in Data.Array.Base, apparently. (I managed to find an example in the Gtk2hs "fastdraw" demo.) Oddly, this seems to make virtually no performance difference. I find this highly surprising. But then, I perform 16 write to the array, and 64 reads from the ByteString, so maybe that's where I should be worrying about bounds checks...
(I# a) <<# (I# b) = (I# (a `iShiftL#` b)) (I# a) >># (I# b) = (I# (a `uncheckedIShiftRL#` b))
Does it make a difference if I want Word32 instead of Int?

2008/5/18 Andrew Coppin
unsafeRead/Write
Found in Data.Array.Base, apparently. (I managed to find an example in the Gtk2hs "fastdraw" demo.)
Oddly, this seems to make virtually no performance difference. I find this highly surprising. But then, I perform 16 write to the array, and 64 reads from the ByteString, so maybe that's where I should be worrying about bounds checks...
Hi Andrew, just a profiling suggestion: did you try to use the SCC cost-centre annotations for profiling? If you want to know precisely what takes 60% of time, you can try: bn = {-# SCC "IntegerConversion" #-} 4 * fromIntegral wn b0 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+0) b1 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+1) b2 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+2) b3 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+3) w = foldl' (\w b -> shiftL w 8 .|. fromIntegral b) 0 [b3,b2,b1,b0] in {-# SCC "ArrayWriting" #-} unsafeWrite temp wn w In profiling the time of all expressions with the same SCC "name" will be summed. You can get more information about SCC here: http://www.haskell.org/ghc/docs/latest/html/users_guide/profiling.html#cost-... Obviously, even ultra-optimizing this function to be 60 times faster (assuming that this is possible) will only make your program go about 2 time faster. One advice: I've seen various md5sum implementations in C, all using about the same algorithms and data structures, and they performed even with 10x time differences between them; md5summing fast is not a really simple problem. If I were you I would drop the comparison with ultra-optimized C and concentrate on "does my high-level-good-looking-super-readable implementation perform acceptably?". What "acceptably" means is left as an exercise to the reader. Salvatore

Salvatore Insalaco wrote:
Hi Andrew, just a profiling suggestion: did you try to use the SCC cost-centre annotations for profiling? If you want to know precisely what takes 60% of time, you can try: bn = {-# SCC "IntegerConversion" #-} 4 * fromIntegral wn b0 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+0) b1 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+1) b2 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+2) b3 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+3) w = foldl' (\w b -> shiftL w 8 .|. fromIntegral b) 0 [b3,b2,b1,b0] in {-# SCC "ArrayWriting" #-} unsafeWrite temp wn w
In profiling the time of all expressions with the same SCC "name" will be summed. You can get more information about SCC here: http://www.haskell.org/ghc/docs/latest/html/users_guide/profiling.html#cost-...
OK, I'll give that a try...
One advice: I've seen various md5sum implementations in C, all using about the same algorithms and data structures, and they performed even with 10x time differences between them; md5summing fast is not a really simple problem. If I were you I would drop the comparison with ultra-optimized C and concentrate on "does my high-level-good-looking-super-readable implementation perform acceptably?".
What "acceptably" means is left as an exercise to the reader.
Well, so long as it can hash 500 MB of data in a few minutes without using absurd amounts of RAM, that'll do for me. ;-) [I actually wanted to do this for a project at work. When I discovered that none of the available Haskell implementations are fast enough, I tried to write my own. But that wasn't fast enough either. So I ended up being forced to call md5sum.exe and attempt to parse its output. Obviously having a real Haskell function would be infinitely more portable and convinient...]

andrewcoppin:
Salvatore Insalaco wrote:
Hi Andrew, just a profiling suggestion: did you try to use the SCC cost-centre annotations for profiling? If you want to know precisely what takes 60% of time, you can try: bn = {-# SCC "IntegerConversion" #-} 4 * fromIntegral wn b0 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+0) b1 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+1) b2 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+2) b3 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+3) w = foldl' (\w b -> shiftL w 8 .|. fromIntegral b) 0 [b3,b2,b1,b0] in {-# SCC "ArrayWriting" #-} unsafeWrite temp wn w
In profiling the time of all expressions with the same SCC "name" will be summed. You can get more information about SCC here: http://www.haskell.org/ghc/docs/latest/html/users_guide/profiling.html#cost-...
OK, I'll give that a try...
One advice: I've seen various md5sum implementations in C, all using about the same algorithms and data structures, and they performed even with 10x time differences between them; md5summing fast is not a really simple problem. If I were you I would drop the comparison with ultra-optimized C and concentrate on "does my high-level-good-looking-super-readable implementation perform acceptably?".
What "acceptably" means is left as an exercise to the reader.
Well, so long as it can hash 500 MB of data in a few minutes without using absurd amounts of RAM, that'll do for me. ;-)
[I actually wanted to do this for a project at work. When I discovered that none of the available Haskell implementations are fast enough,
How hard did you look? import System.Environment import Data.Digest.OpenSSL.MD5 import System.IO.Posix.MMap main = do [f] <- getArgs putStrLn . md5sum =<< mmapFile f Take the md5 of a 600M file: $ time ./A /home/dons/tmp/600M 24a04fdf3f629a42b5baed52ed793a51 ./A /home/dons/tmp/600M 3.61s user 1.65s system 20% cpu 25.140 total Easy. -- Don

Andrew, What is "fast enough"? My informal understanding of the implementations are: 1) Unoptimized [Word8] implementations (constant space, two or three orders of magnatude slower than C) 2) Bindings to 'C' routines (typically O(n) space due to strict ByteStrings and no split out of initialContext/md5Update/md5Finialize) 3) Native Lazy.ByteString implementation (perhaps more than one?) ~3-6 times slower than 'C'. Tom On Tue, 2008-05-20 at 19:44 +0100, Andrew Coppin wrote:
Salvatore Insalaco wrote:
Hi Andrew, just a profiling suggestion: did you try to use the SCC cost-centre annotations for profiling? If you want to know precisely what takes 60% of time, you can try: bn = {-# SCC "IntegerConversion" #-} 4 * fromIntegral wn b0 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+0) b1 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+1) b2 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+2) b3 = {-# SCC "ByteStringIndexing" #-} B.index bi (bn+3) w = foldl' (\w b -> shiftL w 8 .|. fromIntegral b) 0 [b3,b2,b1,b0] in {-# SCC "ArrayWriting" #-} unsafeWrite temp wn w
In profiling the time of all expressions with the same SCC "name" will be summed. You can get more information about SCC here: http://www.haskell.org/ghc/docs/latest/html/users_guide/profiling.html#cost-...
OK, I'll give that a try...
One advice: I've seen various md5sum implementations in C, all using about the same algorithms and data structures, and they performed even with 10x time differences between them; md5summing fast is not a really simple problem. If I were you I would drop the comparison with ultra-optimized C and concentrate on "does my high-level-good-looking-super-readable implementation perform acceptably?".
What "acceptably" means is left as an exercise to the reader.
Well, so long as it can hash 500 MB of data in a few minutes without using absurd amounts of RAM, that'll do for me. ;-)
[I actually wanted to do this for a project at work. When I discovered that none of the available Haskell implementations are fast enough, I tried to write my own. But that wasn't fast enough either. So I ended up being forced to call md5sum.exe and attempt to parse its output. Obviously having a real Haskell function would be infinitely more portable and convinient...]
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (14)
-
Aaron Denney
-
Achim Schneider
-
Andrew Coppin
-
Brandon S. Allbery KF8NH
-
Bulat Ziganshin
-
Don Stewart
-
Duncan Coutts
-
Jules Bean
-
Ketil Malde
-
Neil Mitchell
-
Peter Verswyvelen
-
Salvatore Insalaco
-
Sebastian Sylvan
-
Thomas M. DuBuisson