OCaml list sees abysmal Language Shootout results

older
Re: [Haskell-cafe] Re: OCaml list...

Keith Wansbrough

28 Sep 2004 28 Sep '04

9:46 a.m.

I just saw this on the OCaml list (in a posting by "Rafael 'Dido' Sevilla" in the "Observations on OCaml vs. Haskell" thread). I can't believe that a simple "wc" implementation should be 570 times slower in Haskell than OCaml - could someone investigate and fix the test? --KW 8-) http://caml.inria.fr/archives/200409/msg00485.html

...

...
2. Haskell strings are lists of characters

It's annoying that strings aren't normally processed this way in OCaml, and even more annoying that (^) or (::) cannot be used in pattern matching over strings. I like Haskell's approach. The list concatenation operator is the same as the string concatenation operator in Haskell.

This is something of an efficiency/elegance tradeoff. Making strings act like lists means potentially boxing *every character in the string*. In other words, it's potentially a very expensive way of doing business. Paul Graham was mulling over this kind of tradeoff in his design of Arc, as I recall. Another language that does this type of thing is Erlang, and both languages seem to be significantly slower than OCaml in string handling, at least as far as this site goes:

http://shootout.alioth.debian.org/

For the word count benchmark OCaml scores 0.1850 seconds, while GHC is a dismal last place at 105.2110 seconds! Even the bytecode ocaml is an order of magnitude faster. The word frequency benchmark also shows this kind of poor string handling performance for Haskell, with OCaml scoring 0.5669 seconds, while GHC scores a truly dismal 6.4540, more than an order of magnitude slower, and even the bytecode ocaml is faster at 4.2644 seconds.

All in all, it would appear that Haskell's approach has been expensive in terms of performance, if the benchmarks are to be taken at face value. Such are the tradeoffs language designers have to make.

Show replies by date

Malcolm Wallace

28 Sep 28 Sep

9:55 a.m.

Keith Wansbrough writes:

...

I can't believe that a simple "wc" implementation should be 570 times slower in Haskell than OCaml - could someone investigate and fix the test?

With code like this, I'm not surprised! main = do file <- getContents putStrLn $ show (length $ lines file) ++ " " ++ show (length $ words file) ++ " " ++ show (length file) Space-leak or what? Regards, Malcolm

Graham Klyne

29 Sep 29 Sep

12:41 p.m.

At 10:55 28/09/04 +0100, Malcolm Wallace wrote:

...

Keith Wansbrough writes:

...
I can't believe that a simple "wc" implementation should be 570 times slower in Haskell than OCaml - could someone investigate and fix the test?

With code like this, I'm not surprised!

main = do file <- getContents putStrLn $ show (length $ lines file) ++ " " ++ show (length $ words file) ++ " " ++ show (length file)

Space-leak or what?

Er, please excuse a dumb question, but I'm struggling to see the problem here. I can see that this requires the original file to be kept for 3-time scanning, so enough memory for the entire file will be required. Is that *the* problem to which you allude? I can't see any other problem here. And why would this put Haskell at a disadvantage? #g ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

Tomasz Zielonka

6:29 p.m.

On Wed, Sep 29, 2004 at 01:41:03PM +0100, Graham Klyne wrote:

...

...
With code like this, I'm not surprised!

main = do file <- getContents putStrLn $ show (length $ lines file) ++ " " ++ show (length $ words file) ++ " " ++ show (length file)

Space-leak or what?

Er, please excuse a dumb question, but I'm struggling to see the problem here.

I can see that this requires the original file to be kept for 3-time scanning, so enough memory for the entire file will be required.

It would be nice if these scans were performed concurrently in a way that would make memory usage constant, wouldn't it? ;) Hmmm... maybe some simple tracking of garbage collection results would suffice... Reschedule if the current thread doesn't help in collecting garbage... But I am dreaming now... :)

...

Is that *the* problem to which you allude? I can't see any other problem here. And why would this put Haskell at a disadvantage?

The only problem is that some people may draw incorrect conclusions. Should we care? I already submitted two improvements for shootout in the last two days (not included yet), but I don't know if it's worth the effort. I remember SPJ's motto: ,,Avoid success at all cost''. Is this motto still valid? http://research.microsoft.com/Users/simonpj/papers/haskell-retrospective/Has... Best regards, Tom -- .signature: Too many levels of symbolic links

Malcolm Wallace

6:53 p.m.

Graham Klyne writes:

...

...
main = do file <- getContents putStrLn $ show (length $ lines file) ++ " " ++ show (length $ words file) ++ " " ++ show (length file)

Space-leak or what?

I can see that this requires the original file to be kept for 3-time scanning, so enough memory for the entire file will be required. Is that *the* problem to which you allude? I can't see any other problem here.

Yes, it is the main problem. Don't forget, the shootout benchmark runs this example over a very large input (15Mb). Since the character-list stored in memory for this file takes approximately 12 bytes per character, that blows up to about 180Mb to store temporarily. The shootout performance figures reckon that ghc actually uses 223Mb in total.

...

And why would this put Haskell at a disadvantage?

Large live heap space means a large time spent in GC, trying to find the needle that is actual garbage in the haystack of live pointers. It also increases the likelihood of cache misses and all kinds of other bad memory effects. In other words, wasting space is wasting time. There is a good literature on heap profiling in Haskell which demonstrates the benefits of keeping space usage small to improve time performance. In any case, for the shootout, this is patently a different algorithm to the one every other solution uses. All the other languages do a simple one-pass traversal of the file, in constant memory space. Why artificially handicap Haskell by insisting it do the job naively? Regards, Malcolm

Greg Buchholz

10:58 p.m.

Malcolm Wallace wrote:

...

Graham Klyne writes:

...
I can see that this requires the original file to be kept for 3-time scanning, so enough memory for the entire file will be required. Is that *the* problem to which you allude? I can't see any other problem here.

Yes, it is the main problem. <snip> In any case, for the shootout, this is patently a different algorithm to the one every other solution uses. All the other languages do a simple one-pass traversal of the file, in constant memory space. Why artificially handicap Haskell by insisting it do the job naively?

Just for the heck of it, I'd thought I'd try to write a naive 1-pass version of the program. It turned out to be 4.5 times slower than the original... -- compiled with: ghc -O2 -ddump-simpl -fvia-c -o wc_fold wc_fold.hs import IO main = do file <- getContents putStrLn (show (foldl wc (0,0,0) file)) wc :: (Int,Int,Int) -> Char -> (Int, Int, Int) wc (l,w,c) '\n' = (l+1,w ,c+1) wc (l,w,c) ' ' = (l ,w+1,c+1) wc (l,w,c) x = (l ,w ,c+1) The algorithm isn't correct (it counts spaces instead of words), but anyone have advice for improving its performance? Is the problem caused by the laziness of the Int's inside the tuple? Here is the report from ghc with the '-ddump-simpl' option. Main.wc :: (GHC.Base.Int, GHC.Base.Int, GHC.Base.Int) -> GHC.Base.Char -> (GHC.Base.Int, GHC.Base.Int, GHC.Base.Int) [GlobalId] Arity 2 __P Main.$wwc NoCafRefs Str: DmdType U(LLL)U(L)m Main.wc = __inline_me (\ w :: (GHC.Base.Int, GHC.Base.Int, GHC.Base.Int) w1 :: GHC.Base.Char -> case w of w2 { (ww, ww1, ww2) -> case w1 of w3 { GHC.Base.C# ww3 -> case Main.$wwc ww ww1 ww2 ww3 of ww4 { (# ww5, ww6, ww7 #) -> (ww5, ww6, ww7) } } }) Rec { Main.$wlgo :: GHC.Base.Int -> GHC.Base.Int -> GHC.Base.Int -> [GHC.Base.Char] -> (# GHC.Base.Int, GHC.Base.Int, GHC.Base.Int #) And here is the memory allocation report generated from passing '+RTS -M900m -k250m -sstderr' to the executable... 875,150,472 bytes allocated in the heap 149,008,464 bytes copied during GC 250,845,060 bytes maximum residency (2 sample(s)) 478 collections in generation 0 ( 10.44s) 2 collections in generation 1 ( 0.00s) 813 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 0.93s ( 1.03s elapsed) GC time 10.44s ( 11.38s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 11.37s ( 12.41s elapsed) %GC time 91.8% (91.7% elapsed) Alloc rate 941,022,012 bytes per MUT second Productivity 8.2% of total user, 7.5% of total elapsed Finally, here's the profiling report (-pT)... Wed Sep 29 15:21 2004 Time and Allocation Profiling Report (Final) wc_fold +RTS -M800m -k250m -pT -RTS total time = 1.22 secs (61 ticks @ 20 ms) total alloc = 173,502,056 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc wc Main 60.7 64.8 main Main 39.3 35.2 individual inherited COST CE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 146 1 39.3 35.2 100.0 100.0 wc Main 147 3048000 60.7 64.8 60.7 64.8 CAF GHC.Handle 103 3 0.0 0.0 0.0 0.0 CAF System.Posix.Internals 81 4 0.0 0.0 0.0 0.0 Any hints appreciated, Greg Buchholz

Sam Mason

30 Sep 30 Sep

12:02 a.m.

Greg Buchholz wrote:

...

The algorithm isn't correct (it counts spaces instead of words), but anyone have advice for improving its performance?

You probably want some strictness annotations in there. . . When I tried the same thing, I came up with something like:

...

import Char;

cclass c | isSpace c = (c == '\n', False) | otherwise = (False , True)

data Cdata = Cdata !Bool !Int !Int !Int deriving Show

combine (Cdata last l w c) (nl, iw) = Cdata iw l' w' (c + 1) where l' = if nl then l + 1 else l w' = if not last && iw then w + 1 else w

wc = foldl combine (Cdata False 0 0 0) . map cclass

main = getContents >>= putStrLn . show . wc

It seems to work in constant stack space, and gives the same answers (albeit not very beautifully) as my GNU copy of "wc".

...

Is the problem caused by the laziness of the Int's inside the tuple?

I'm pretty sure that's what's causing it. I had quite a search around when my version was running out of memory and everything seemed to suggest that, basically, the algorithm is building a massive list of "+1"s that only actually get executed when the you try and print the totals at the end. Any comments from more authoritative sources?

...

Here is the report from ghc with the '-ddump-simpl' option.

If anyone has any hints about how to read this output, I'd be grateful. It makes a bit of sense, but I don't really know what it "means". I.e. it's obviously the simplified parse tree and I can see how it relates back to the source (loosely), but attempting to figure out if something's going to be as leaky as a sieve or not is beyond me.

Georg Martius

8:24 p.m.

Hi folks, On Thu, 30 Sep 2004 01:02:54 +0100, Sam Mason wrote:

...

Greg Buchholz wrote:

...
The algorithm isn't correct (it counts spaces instead of words), but anyone have advice for improving its performance?

You probably want some strictness annotations in there. . . <snip>

Last night as I have tried to improve Gregs wc in a simple fashion and came up with the same idea to make a new data type with strict fields. I thought why one couldn't add some kind of strictness annotation to the function type. First attempt: wc :: !(Int,Int,Int) -> Char -> (Int, Int, Int) As far as I know the compiler does strictness analysis to find strict arguments in a function anyway. Would it make sense to allow this kind of explicit strictness? Where are the problems with that? I mean lazyness is really useful and it is our best friend in this kind of application, since we can make stream processing without implementing a buffer and so on. On the other hand one gets occasionally traped by it and it is not allways easy to grasp why. Some more general comment: The code for the shootout doesn't need to be extremly fast in my eyes, it needs to be elegant and reasonable at performance and memory consuptions (In this order). I don't want to say that Thomaszs solution is bad, but it is not a typical Haskell style application. If someone (not haskeller) looks at the implementation it should be very obvious and clear. The last few weeks the list have been full of performance questions (including my own ones) and it is really a pitty that it is such an issue. I believe that most problems occuring with performance and memory consumptions could be easily solved by partial and explicit strictness. Please enlight me if I am wrong. Regards, Georg

Greg Buchholz

10:03 p.m.

Georg Martius wrote:

...

Some more general comment: The code for the shootout doesn't need to be extremly fast in my eyes, it needs to be elegant and reasonable at performance and memory consuptions (In this order). I don't want to say that Thomaszs solution is bad, but it is not a typical Haskell style application. If someone (not haskeller) looks at the implementation it should be very obvious and clear.

It might also be nice if the code would run under the other haskell compliers like Hugs and nhc98 right out-of-the-box. Greg Buchholz

Ketil Malde

1 Oct 1 Oct

6:36 a.m.

"Georg Martius" writes:

...

Last night as I have tried to improve Gregs wc in a simple fashion and came up with the same idea to make a new data type with strict fields. I thought why one couldn't add some kind of strictness annotation to the function type. First attempt:

...

wc :: !(Int,Int,Int) -> Char -> (Int, Int, Int)

I'm not sure if that was your question, but I think this will just ensure that the first argument is really a tuple (and not bottom), it doesn't evaluate the Ints inside the tuple strictly. Try something like wc :: !Int -> !Int -> !Int -> Char -> (Int,Int,Int) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde

6:48 a.m.

Ketil Malde writes:

...

...
wc :: !(Int,Int,Int) -> Char -> (Int, Int, Int)

...

I'm not sure if that was your question

Sorry about that, brain malfunction, bangs are for data declarations, I'll get that cup of coffee now. I guess what you really want to do, is to put some `seq`s in there. Something like: wc (cs,ws,ls) ... = cs `seq` ws `seq` ls `seq` ...the def. of wc... which evaluates the Ints before doing anything else. Or use (!$) (like function application ($), but strict). -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Georg Martius

30 Sep 30 Sep

12:34 a.m.

On Wed, 29 Sep 2004 15:58:24 -0700, Greg Buchholz wrote: <snip>

...

Just for the heck of it, I'd thought I'd try to write a naive 1-pass version of the program. It turned out to be 4.5 times slower than the original...

-- compiled with: ghc -O2 -ddump-simpl -fvia-c -o wc_fold wc_fold.hs

import IO

main = do file <- getContents putStrLn (show (foldl wc (0,0,0) file))

wc :: (Int,Int,Int) -> Char -> (Int, Int, Int) wc (l,w,c) '\n' = (l+1,w ,c+1) wc (l,w,c) ' ' = (l ,w+1,c+1) wc (l,w,c) x = (l ,w ,c+1)

use strictness is the trick to exploit the tail-recursion. There is possibly a better way but: data C = C !Int !Int !Int deriving Show wc' :: C -> Char -> C wc' (C l w c) '\n' = C (l+1) w (c+1) wc' (C l w c) ' ' = C l (w+1) (c+1) wc' (C l w c) x = C l w (c+1) is significant faster. The results on a 12MB file: original: 1,699,541,396 bytes allocated in the heap 340,796,888 bytes copied during GC 75,415,872 bytes maximum residency (8 sample(s)) 146 Mb total memory in use Total time 5.92s ( 6.09s elapsed) wc: (yours) I have not enough memory :-( wc': 535,286,872 bytes allocated in the heap 187,340,032 bytes copied during GC 135,696 bytes maximum residency (130 sample(s)) 2 Mb total memory in use Total time 2.45s ( 2.53s elapsed) Cheers, Georg -- ---- Georg Martius, Tel: (+49 34297) 89434 ---- ------- http://www.flexman.homeip.net ---------

Malcolm Wallace

10:26 a.m.

Just out of interest, I ran all of these suggested variations of the word count solution in Haskell head-to-head against each other. Here are the results, in seconds, on my machine (2.4GHz x86/Linux) for the suggested input (N=500) from the shootout site. All Haskell versions were compiled with ghc-5.04.2 -O2. original space-leaky 2.257 Greg Buchholz 1.619 * Sam Mason 0.594 Malcolm Wallace 0.457 Georg Martius 0.322 * Tomasz Zielonka 0.047 linux 'wc' 0.085 Those marked with a * gave the wrong number of words. The really interesting thing is that Tomasz's solution is twice as fast as the standard Gnu implementation! Regards, Malcolm

Tomasz Zielonka

10:36 a.m.

On Thu, Sep 30, 2004 at 11:26:15AM +0100, Malcolm Wallace wrote:

...

Those marked with a * gave the wrong number of words. The really interesting thing is that Tomasz's solution is twice as fast as the standard Gnu implementation!

That's probably because Gnu wc is locale aware. Best regards, Tom -- .signature: Too many levels of symbolic links

Kevin Everets

1:49 p.m.

On Thu, Sep 30, 2004 at 11:26:15AM +0100, Malcolm Wallace wrote:

...

Just out of interest, I ran all of these suggested variations of the word count solution in Haskell head-to-head against each other. Here are the results, in seconds, on my machine (2.4GHz x86/Linux) for the suggested input (N=500) from the shootout site. All Haskell versions were compiled with ghc-5.04.2 -O2.

original space-leaky 2.257 Greg Buchholz 1.619 * Sam Mason 0.594 Malcolm Wallace 0.457 Georg Martius 0.322 * Tomasz Zielonka 0.047 linux 'wc' 0.085

Those marked with a * gave the wrong number of words. The really interesting thing is that Tomasz's solution is twice as fast as the standard Gnu implementation!

I took Georg's, fixed the word count logic and added prettier printing, and then combined it with Sam's main (which I find more elegant, but others may find less straightforward). I think it strikes a good balance between efficiency and elegance. Cheers, Kevin. ------ import IO main = getContents >>= putStrLn . showC . foldl wc' (C 0 0 0 False) data C = C !Int !Int !Int !Bool deriving Show -- Line Word Char InWord showC (C l w c _) = show l ++ " " ++ show w ++ " " ++ show c wc' :: C -> Char -> C wc' (C l w c _) '\n' = C (l+1) w (c+1) False wc' (C l w c _) ' ' = C l w (c+1) False wc' (C l w c _) '\t' = C l w (c+1) False wc' (C l w c False) _ = C l (w+1) (c+1) True wc' (C l w c True) _ = C l w (c+1) True

Tomasz Zielonka

2:56 p.m.

On Thu, Sep 30, 2004 at 09:49:46AM -0400, Kevin Everets wrote:

...

I took Georg's, fixed the word count logic and added prettier printing, and then combined it with Sam's main (which I find more elegant, but others may find less straightforward). I think it strikes a good balance between efficiency and elegance.

Then how about a solution like this: I took your program but used my fast fileIterate instead of ,,foldl over getContents''. I also added {-# OPTIONS -funbox-strict-fields #-}, and played a bit to get the best optimisations from GHC. It's about 7 times faster this way, but it's still two times slower than the solution I sent to shootout. Devilish plan: Maybe we could have some variants of fileIterate in GHC's libraries? ;-> I remember that someone proposed similar functions on haskell's lists some time ago, but can't remember who. Best regards, Tom -- .signature: Too many levels of symbolic links

Graham Klyne

4:40 p.m.

At 16:56 30/09/04 +0200, Tomasz Zielonka wrote:

...

Then how about a solution like this: I took your program but used my fast fileIterate instead of ,,foldl over getContents''. I also added {-# OPTIONS -funbox-strict-fields #-}, and played a bit to get the best optimisations from GHC.

It's about 7 times faster this way, but it's still two times slower than the solution I sent to shootout.

Devilish plan: Maybe we could have some variants of fileIterate in GHC's libraries? ;->

Two responses: 1. I agree that providing the right kind of library functions (and material explaining how to use them) maybe a key to getting efficient code without losing high-level forms of expression. 2. Your fileIterator certainly looks nicer (to me) than your other solution, but... Tagging along with this debate, I found myself wondering if, in order to get performance comparable to other languages, it is really necessary to write code like code in other languages. E.g., I thought one of the lessons of John Hughes' "Why functional Programming matters" was that one can achieve greater efficiencies by climbing higher rather than dropping down to the level of other languages. Your fileIterate looks to me like a step in the right direction. But I did wonder if it wouldn't be possible to also abstract out the I/O element of your 'fileIterate', using instead something like: streamIterate :: [b] -> (a -> b -> a) -> a -> IO a (I was originally thinking of something like: streamIterate :: (c -> (b,c)) -> c -> (a -> b -> a) -> a -> IO a where the first argument is a function that takes a sequence generator and returns the next member of the sequence+new generator, and the 2nd arg is the initial generator.) For such an approach to be useful, I think it would also be important to have variations of functions like length, lines, words that can be combined to make a function like your wc'. #g ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

Tomasz Zielonka

5:39 p.m.

On Thu, Sep 30, 2004 at 05:40:58PM +0100, Graham Klyne wrote:

...

2. Your fileIterator certainly looks nicer (to me) than your other solution, but...

It looks nicer to me too.

...

Tagging along with this debate, I found myself wondering if, in order to get performance comparable to other languages, it is really necessary to write code like code in other languages.

Maybe it's not necessary to get the comparable performance, but it often is if you want to get the best performance possible. Of course I wouldn't mind if GHC optimised high level code so well, that we wouldn't have to do it, but I guess it's just not that easy. What I like about GHC is that I can start from simple, high-level, sometimes slow solutions, but if there are efficiency problems, there is a big chance that I can solve them without switching the language.

...

But I did wonder if it wouldn't be possible to also abstract out the I/O element of your 'fileIterate', using instead something like: streamIterate :: [b] -> (a -> b -> a) -> a -> IO a

It seems to be a variant of foldl. You can eliminate IO from return type, or is there some reason for it? Best regards, Tom -- .signature: Too many levels of symbolic links

Graham Klyne

6:06 p.m.

At 19:39 30/09/04 +0200, Tomasz Zielonka wrote:

...

What I like about GHC is that I can start from simple, high-level, sometimes slow solutions, but if there are efficiency problems, there is a big chance that I can solve them without switching the language.

That's a very good point, I think. One to hang on to.

...

...
But I did wonder if it wouldn't be possible to also abstract out the I/O element of your 'fileIterate', using instead something like: streamIterate :: [b] -> (a -> b -> a) -> a -> IO a

It seems to be a variant of foldl. You can eliminate IO from return type, or is there some reason for it?

Doh! (Why didn't I spot that?) All roads lead to Rome, or something like that? There seems to be a recurring tension between how much to specialize and how much to generalize. Maybe it should be something like: streamIterate :: (Monad m) => [b] -> (a -> b -> m a) -> a -> m a ? Er, but that's similarly a variation of foldM, right? Or maybe my earlier idea was closer: streamIterate :: (Monad m) => (c -> m (b,c)) -> c -> (a -> b -> m a) -> a -> m a ? Hmmm... I feel like a (intellectual) bull-in-a-china-shop here. I'm blundering about on the trail of a delicate and elegant idea that I'm sure others could dissect far more clearly. What I'm trying to capture (I think) is that there's some baggage to do with accessing the raw data and emitting the desired result that needs to be carried along (or interleaved) with the required computation on that data. Does this make any sense, or am I descending into farce here? #g ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

Greg Buchholz

8:32 p.m.

Malcolm Wallace wrote:

...

Here are the results, in seconds, on my machine (2.4GHz x86/Linux) for the suggested input (N=500) from the shootout site. All Haskell versions were compiled with ghc-5.04.2 -O2.

I thought I'd take a stab at timing a few of the examples with different compiler versions to see what difference that would make (ghc-6.2.1 vs. ghc-6.3.20040928). I compared Kevin Everets' version with the two Tomasz Zielonka versions. I ran the test with N=2500 (i.e. 2500 copies of the original file, which is what is apparently used in the shootout) on my AthlonXP 1600 under x86/Linux. 6.2.1 6.3.20040928 ------- ------- Kevin 3.615s 3.156s Kevin (+RTS -G1) 1.666s 1.405s Tomasz (pretty) 0.725s 0.481s Tomasz (fast) 0.403s 0.430s Interesting to see the speed increase going from 6.2.1 to 6.3 for Tomasz' pretty example. Anyone have an explaination for the 2x speed increase for running Kevin's version with '+RTS -G1'? (And for reference, here's the results on my machine for the perl and gcc versions of the test and gnu/wc) perl-5.8.4 0.535s gcc-3.4.2 0.102s gnu/wc 0.435s Greg Buchholz

Andrew Cheadle

8:53 p.m.

Hi Greg

...

Anyone have an explaination for the 2x speed increase for running Kevin's version with '+RTS -G1'?

+RTS -Sstderr -RTS and +RTS -sstderr -RTS will probably indicate why. I'd be surprised if the amount of data copied for the semi-space collector isn't much less than for the generational. Chances are that data is dying off very quickly and very little is being copied for the semi-space collector - the allocation area has a variable sizing policy and this size of allocation area is sufficient for the majority of the data to die off before it is filled and a GC kicks in - hence very little is copied. However, for the generational collector, the nursery is of fixed size. Here, the lifetime of the data is probably longer than the time it takes for the nursery to be filled and a minor GC kicks in and the data promoted to generation 1 (hence copied) where it then probably dies off but can't be collected until a major collection kicks in. Probably, or something like that ;-) Cheers Andy ********************************************************************* * Andrew Cheadle email: a.cheadle@doc.ic.ac.uk * * Department of Computing http://www.doc.ic.ac.uk/~amc4/ * * Imperial College * * University of London * *********************************************************************

Greg Buchholz

9:29 p.m.

Andrew Cheadle wrote:

...

+RTS -Sstderr -RTS and +RTS -sstderr -RTS will probably indicate why. I'd be surprised if the amount of data copied for the semi-space collector isn't much less than for the generational.

Ahh. Data copied with '-G1' = 58MB vs. 203MB without. For posterities sake, here are the numbers... With '-G1' --------------------------------------------------- 306,616,872 bytes allocated in the heap 58,844,344 bytes copied during GC 99,316 bytes maximum residency (1169 sample(s)) 1169 collections in generation 0 ( 0.62s) 1 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 0.68s ( 0.71s elapsed) GC time 0.62s ( 0.68s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.30s ( 1.39s elapsed) %GC time 47.7% (48.9% elapsed) Alloc rate 450,907,164 bytes per MUT second Productivity 52.3% of total user, 48.9% of total elapsed Without --------------------------------------------------- 306,616,872 bytes allocated in the heap 203,339,812 bytes copied during GC 109,088 bytes maximum residency (131 sample(s)) 1169 collections in generation 0 ( 2.22s) 131 collections in generation 1 ( 0.05s) 2 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 0.79s ( 0.92s elapsed) GC time 2.27s ( 2.23s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 3.06s ( 3.15s elapsed) %GC time 74.2% (70.8% elapsed) Alloc rate 388,122,622 bytes per MUT second Productivity 25.8% of total user, 25.1% of total elapsed Greg Buchholz

Peter Simons

3:06 p.m.

New subject: State monad strictness (was: ... abysmal Language Shootout results)

How can anyone stay away from such a deliciously pointless waste of time as implementing a wc(1) derivate? :-) Here is my attempt:

...

import IO

type Count = Int data CountingState = ST !Bool !Count !Count !Count deriving (Show)

initCST = ST True 0 0 0

wc :: CountingState -> [Char] -> CountingState wc (ST _ l w c) ('\n':xs) = wc (ST True (l+1) w (c+1)) xs wc (ST _ l w c) (' ' :xs) = wc (ST True l w (c+1)) xs wc (ST _ l w c) ('\t':xs) = wc (ST True l w (c+1)) xs wc (ST True l w c) ( x :xs) = wc (ST False l (w+1) (c+1)) xs wc (ST False l w c) ( x :xs) = wc (ST False l w (c+1)) xs wc st [] = st

main :: IO () main = do ST _ l w c <- getContents >>= return . wc initCST putStrLn $ (l `shows`) . spaces . (w `shows`) . spaces . (c `shows`) $ [] where spaces = (' ':) . (' ':) . (' ':)

I compiled this with "ghc -O2 -funbox-strict-fields" and got the following performance results in a simple test. The wc(1) tool: $ time /usr/bin/wc

...

import IO import Control.Monad.State

type Count = Int data CountingState = ST !Bool !Count !Count !Count deriving (Show)

type WordCounter = State CountingState ()

initCST = ST True 0 0 0

wc :: Char -> WordCounter wc x = get >>= \(ST b l w c) -> case (b,x) of ( _ , '\n') -> put (ST True (l+1) w (c+1)) ( _ , '\t') -> put (ST True l w (c+1)) ( _ , ' ' ) -> put (ST True l w (c+1)) (True, _ ) -> put (ST False l (w+1) (c+1)) (False, _ ) -> put (ST False l w (c+1))

main :: IO () main = do xs <- getContents let ST _ l w c = snd (runState (mapM wc xs) initCST) putStrLn $ (l `shows`) . spaces . (w `shows`) . spaces . (c `shows`) $ [] where spaces = (' ':) . (' ':) . (' ':)

Curiously enough, this version fails to process the "words" file because it runs out of stack space! Naturally, it is very slow, too. So I wonder: How needs that program above to be changed in order to solve this space leak? Why does this happen in the first place? Peter

Andre Pang

6 Oct 6 Oct

5:23 p.m.

On 29/09/2004, at 8:41 AM, Graham Klyne wrote:

...

I can see that this requires the original file to be kept for 3-time scanning, so enough memory for the entire file will be required. Is that *the* problem to which you allude? I can't see any other problem here. And why would this put Haskell at a disadvantage?

I've been watching this thread with interest, and posted my own thoughts on this thread and Haskell's performance in general as a blog entry. Rather than repeat it all here, I'll post a link to it: http://www.algorithm.com.au/mt/haskell/haskells_performance.html The executive summary of my thoughts is that it seems to be entirely possible to optimise Haskell to be competitive with other, more performance-focused languages, but it's hard, and you have to be a Haskell expert to do so. One possible solution may be to allow for some extra, syntactically integrated declarations to be inserted by the programmer which enables much better optimisation (e.g. see how to write unboxed strict array example in Clean: much more clear and less work than using IOUArrays). Performance is the one major reason I recommend many existing C programmers try out O'Caml rather than Haskell as their first functional programming language, and it would be really nice if optimisation was made a bit easier. -- % Andre Pang : trust.in.love.to.save

Tomasz Zielonka

5:55 p.m.

New subject: OCaml list sees abysmal Language Shootout results

On Wed, Oct 06, 2004 at 01:23:56PM -0400, Andre Pang wrote:

...

I've been watching this thread with interest, and posted my own thoughts on this thread and Haskell's performance in general as a blog entry. Rather than repeat it all here, I'll post a link to it:

http://www.algorithm.com.au/mt/haskell/haskells_performance.html

I feel a bit guilty for my ugly wc implementation. At the moment of writing the first version I was thinking only about efficiency, not about elegance. I shouldn't have use unsafeRead, because it doesn't give such a big advantage here if you take the danger into account. Secondly, my solution fails to separate file iteration from the algoritm for the problem. We have already created on this list a version which is fast and quite elegant at the same time, and I feel this one is better for the shootout even if it's slower than the one currently used (but it doesn't use unsafeRead). The good news is that the development GHC 6.3 compiles this to code which is almost as fast as the ugly one. Maybe we should vote: which wc implementation should go to the shootout? Best regards, Tom -- .signature: Too many levels of symbolic links

Keith Wansbrough

7 Oct 7 Oct

9:39 a.m.

New subject: OCaml list sees abysmal Language Shootout results

...

I feel a bit guilty for my ugly wc implementation. At the moment of writing the first version I was thinking only about efficiency, not about elegance. [..] We have already created on this list a version which is fast and quite elegant at the same time, and I feel this one is better for the shootout even if it's slower than the one currently used (but it doesn't use unsafeRead). The good news is that the development GHC 6.3 compiles this to code which is almost as fast as the ugly one.

I agree; the code should look reasonable - this will be many people's first sight of Haskell code. Count me as a vote for the better-but-slightly-slower wc. --KW 8-) -- Keith Wansbrough http://www.cl.cam.ac.uk/users/kw217/ University of Cambridge Computer Laboratory.

Peter Simons

11:01 a.m.

Keith Wansbrough writes:

...

Count me as a vote for the better-but-slightly-slower wc.

How about the attached program? On my machine it faster than Tomasz's version, and I think it's still a fairly clean source code. Using some random large file for input, I got these results with time(1): real 0m33.883s -- getarray/unsafeRead user 0m22.594s sys 0m2.493s real 0m30.435s -- hgetbuf/peek user 0m13.958s sys 0m2.814s Peter module Main ( main ) where import System.IO import Foreign bufsize :: Int -- our I/O buffer size bufsize = 4096 type Count = Int32 data CountingState = ST !Bool !Count !Count !Count deriving (Show) initCST :: CountingState initCST = ST True 0 0 0 wc :: Char -> CountingState -> CountingState wc '\n' (ST _ l w c) = (ST True (l+1) w (c+1)) wc ' ' (ST _ l w c) = (ST True l w (c+1)) wc '\t' (ST _ l w c) = (ST True l w (c+1)) wc _ (ST True l w c) = (ST False l (w+1) (c+1)) wc _ (ST False l w c) = (ST False l w (c+1)) countBuf :: Ptr Word8 -> Int -> CountingState -> IO CountingState countBuf _ 0 st@(ST _ _ _ _) = return st countBuf ptr n st@(ST _ _ _ _) = do c <- fmap (toEnum . fromEnum) (peek ptr) countBuf (ptr `plusPtr` 1) (n - 1) (wc c st) loop :: Handle -> Ptr Word8 -> CountingState -> IO CountingState loop h ptr st@(ST _ _ _ _) = do rc <- hGetBuf h ptr bufsize if rc == 0 then return st else countBuf ptr rc st >>= (loop h ptr $!) main :: IO () main = do allocaArray bufsize $ \ptr -> do ST _ l w c <- loop stdin ptr initCST putStrLn . shows l . (' ':) . shows w . (' ':) . shows c $ ""

Ketil Malde

11:13 a.m.

New subject: OCaml list sees abysmal Language Shootout results

Peter Simons writes:

...

Keith Wansbrough writes:

...

...
Count me as a vote for the better-but-slightly-slower wc.

...

How about the attached program? On my machine it faster than Tomasz's version, and I think it's still a fairly clean source code

I guess it's possible to submit three different Haskell entries (for GHC, Hugs and NHC). Perhaps we should split up, and aim for performance with GHC, memory with NHC and clarity/brevity with Hugs? (Seriously, it would be very instructive to look at the differences between the three versions.) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Robert

1:04 p.m.

New subject: OCaml list sees abysmal Language Shootoutresults

Here is a quick fix for the Random Numbers category. Looks like a classic space-leak, again on the arithmetic operations. Speeds up results for me about 6X and brings memory usage down out of the stratosphere. w/o strictness: ~/shootout$ time random 900000 +RTS -K32000000 75.544410151 real 0m0.756s user 0m0.670s sys 0m0.050s w/ strictness: ~/shootout$ time random 900000 # <-- don't need massive heap 75.544410151 real 0m0.139s user 0m0.100s sys 0m0.010s -- Robert robdockins@fastmail.fm

Ronny Wichers Schreur

1:34 p.m.

New subject: OCaml list sees abysmal Language Shootout results

Andre Pang writes (in the Haskell cafe):

...

The executive summary of my thoughts is that it seems to be entirely possible to optimise Haskell to be competitive with other, more performance-focused languages, but it's hard...

and from his blog:

...

if you need speed, you can get it in Clean much more easily in than Haskell.

Perhaps Clean gives your more fine-grained control over the memory and execution behaviour of your program, but this is normally at the expense of clarity and elegance of your program. This is where I think we (FP-ers) stand: We can write elegant programs and we can write efficient programs, but combining these two aspects is still difficult. The Haskell and Clean entries in the Language Shootout clearly demonstrate this. Here are some ideas to better combine elegance and efficiency - decouple data and functions It's often easy to make choices in the data structures with an eye to efficiency, for example sometimes it's better to unbox elements (faster access), sometimes it's better to share the elements (use less memory). The programmer should be able to make these decisions without changing the functions that use the data structure. Overloading can help, but perhaps a good implementation of views will improve it further. - more compiler optimisations Preferably the optimisations should be reasonably predictable for the programmer. - programmer guided transformations, optimisations I don't think we'll ever reach the state where compiler optimisations can completely bridge the gap between elegance and efficiency. I don't mind giving it some help, but I'd like to keep my original program. For example, tell the compiler to unfold a function once, so that the function is strict in more arguments in the recursion.

...

[In Clean] you can, for example, make an unboxed, strict array just by writing [# 1, 2, 3 !] rather than [1, 2, 3]

I call [# 1, 2, 3 !] a tail-strict list with unboxed elements (it doesn't have constant time access). Cheers, Ronny Wichers Schreur

Josef Svenningsson

2:40 p.m.

New subject: OCaml list sees abysmal Language Shootout results

Andre, I very much enjoyed reading your blog entry. I would like to make a few comments. First of all I heartly agree with what you call "the main problem". I quote: "The main problem I see with all this is that it’s just too hard for an average Haskell programmer to get good performance out of Haskell". This is just so true and something which we need to do something about. What I would like to comment on is why we have this problem and what we can/should do about it. Why is it difficult to write good performance Haskell programs? Well, I'm not going to try and discuss this general question but focus on I/O since that is what this thread is really about. So let's formulate the question to: Why is it difficult to write good performance I/O in Haskell? From your blog entry I take it that you think that there is something wrong with the language Haskell. Maybe I misinterpret you but this is how I read it. I don't quite agree with this. What *is* true is that it is difficult to write good performance I/O in any of the *implementations* that we have. And this is ofcourse a shame. I think this is partly because fast I/O hasn't been that high a priority. I recall an email dating a few years back where Simon PJ was saying that they haven't spent that much time on making I/O blazingly fast. So perhaps bringing the issue on the table and making the implementors aware of it will improve on the situation. Although I do not believe that the Haskell language itself is to blame for why it's difficult to write decent performing I/O, the standard libraries might be a problem. It may be that the functions they provide are difficult to make efficient and that they encourage abuse. One particular function I have in mind is getContents. Maybe we need another set of functions which while still being highlevel are much easier to implement efficiently. Work in this direction has already started with the BlockIO library. I think this is an exciting and promising route to take. Andre, you suggest adding syntax to help the programmer writing faster code. Somehow I don't think that is the right way to go, even if it is only my gut feeling. I think this problem can and should be solved on the library level rather than on the language design level. But I might ofcourse be wrong. Ok, enough preaching for today. For the record, I also recommend O'Caml rather than Haskell to my performance-aware friends. /Josef Andre Pang wrote:

...

On 29/09/2004, at 8:41 AM, Graham Klyne wrote:

...
I can see that this requires the original file to be kept for 3-time scanning, so enough memory for the entire file will be required. Is that *the* problem to which you allude? I can't see any other problem here. And why would this put Haskell at a disadvantage?

I've been watching this thread with interest, and posted my own thoughts on this thread and Haskell's performance in general as a blog entry. Rather than repeat it all here, I'll post a link to it:

http://www.algorithm.com.au/mt/haskell/haskells_performance.html

The executive summary of my thoughts is that it seems to be entirely possible to optimise Haskell to be competitive with other, more performance-focused languages, but it's hard, and you have to be a Haskell expert to do so. One possible solution may be to allow for some extra, syntactically integrated declarations to be inserted by the programmer which enables much better optimisation (e.g. see how to write unboxed strict array example in Clean: much more clear and less work than using IOUArrays). Performance is the one major reason I recommend many existing C programmers try out O'Caml rather than Haskell as their first functional programming language, and it would be really nice if optimisation was made a bit easier.

Peter Simons

3:51 p.m.

Josef Svenningsson writes:

...

What *is* true is that it is difficult to write good performance I/O in any of the *implementations* that we have.

GHC has everything you need to do fast I/O in Haskell. If you use hGetBufNonBlocking with 'Ptr a', you have essentially the performance of read(2). That's as fast as it can be on any given system, in any language. Even the lazy API is fast enough for pretty much everything bare the most I/O-intensive applications. The problem is not Haskell, nor is it the implementation. The problem is that beginners, including yours truly, tend to write awfully inefficient code once you give them a "String" and tell them: Here, that's the contents of your file. The lazy API hides the fact that I/O takes place, and it is very tempting to forget that you shouldn't use it like a random access list. The lazy API "encourages" you to write inefficient code. Nonetheless, the problem is the inefficient code, not the language. There has been talk about alternative I/O APIs on the list recently. It would be nice to see something happening on that front. Peter

Ketil Malde

5:22 p.m.

New subject: OCaml list sees abysmal Language Shootout results

Peter Simons writes:

...

The problem is not Haskell, nor is it the implementation. The problem is that beginners, including yours truly, tend to write awfully inefficient code once you give them a "String" and tell them: Here, that's the contents of your file.

And it's just so *convenient* to use it.

...

Nonetheless, the problem is the inefficient code, not the language.

But even a simple, single-pass word-counting using the standard IO is slow, it would be really nice to write this in a straightforward way and still get decent performance. As somebody just said, you get to chose between speed and simplicity/clarity of code. I would like both. Couldn't readFile et al. provide the standard interface, but use hGetBuf tricks (e.g. from your 'wc' entry) behind the curtains? -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Gour

5:38 p.m.

New subject: OCaml list sees abysmal Language Shootout results

Ketil Malde (ketil+haskell@ii.uib.no) wrote:

...

As somebody just said, you get to chose between speed and simplicity/clarity of code. I would like both.

Me too. Simplicity/calrity of code is, imho, one of the strong point in using Haskell.

...

Couldn't readFile et al. provide the standard interface, but use hGetBuf tricks (e.g. from your 'wc' entry) behind the curtains?

That would be nice indeed. It's a pity that Ocaml is getting recommended for a 'real-life' applications instead of Haskell :-( Sincerely, Gour

Peter Simons

7:42 p.m.

New subject: OCaml list sees abysmal Language Shootout results

Ketil Malde writes:

...

Couldn't readFile et al. provide the standard interface, but use hGetBuf tricks (e.g. from your 'wc' entry) behind the curtains?

No amount of hGetBuf'ing will speed the program up if the problem is the algorithm. I/O comes _sequentially_, and every program that doesn't process the input sequentially will require some form of buffering. Lazy I/O provides a great general-purpose buffering mechanism, but it simply can't be as fast of as efficient as hand-written code. IMHO, a good I/O API forces you to write a stateful callback function (that's how I implemented it in BlockIO), so _if_ you want to buffer something, you can use your state for that. But the I/O driver doesn't buffer, what has been read and processed is gone. Then the I/O part will _always_ be as fast as possible, and the design encourages you to process the data sequentially without looking forward/backward too much. It simply means writing a Char -> StateT st IO () -- stream processor or String -> StateT st IO () -- line oriented function to handle a _part_ of the input every time, rather than a String -> IO () function that handles the entire input at once. Peter

Andre Pang

8 Oct 8 Oct

10:05 p.m.

On 07/10/2004, at 11:51 AM, Peter Simons wrote:

...

The problem is not Haskell, nor is it the implementation. The problem is that beginners, including yours truly, tend to write awfully inefficient code once you give them a "String" and tell them: Here, that's the contents of your file. The lazy API hides the fact that I/O takes place, and it is very tempting to forget that you shouldn't use it like a random access list. The lazy API "encourages" you to write inefficient code.

Nonetheless, the problem is the inefficient code, not the language.

If the language encourages you to write inefficient code (or rather, discourages you from writing efficient code), I see that as a problem with the language. You mention the use of hGetBufNonBlocking to get better performance: would you like to rewrite all your code to use hGetBufNonBlocking rather than using e.g. stream-based lazy lists? -- % Andre Pang : trust.in.love.to.save

Tomasz Zielonka

28 Sep 28 Sep

10:01 a.m.

On Tue, Sep 28, 2004 at 10:46:14AM +0100, Keith Wansbrough wrote:

...

I just saw this on the OCaml list (in a posting by "Rafael 'Dido' Sevilla" in the "Observations on OCaml vs. Haskell" thread). I can't believe that a simple "wc" implementation should be 570 times slower in Haskell than OCaml - could someone investigate and fix the test?

No wonder it is so slow, this program looks as a result of some ,,as slow as possible'' contest ;) main = do file <- getContents putStrLn $ show (length $ lines file) ++ " " ++ show (length $ words file) ++ " " ++ show (length file) Best regards, Tom -- .signature: Too many levels of symbolic links

Tomasz Zielonka

10:49 a.m.

On Tue, Sep 28, 2004 at 12:01:11PM +0200, Tomasz Zielonka wrote:

...

On Tue, Sep 28, 2004 at 10:46:14AM +0100, Keith Wansbrough wrote:

...
I just saw this on the OCaml list (in a posting by "Rafael 'Dido' Sevilla" in the "Observations on OCaml vs. Haskell" thread). I can't believe that a simple "wc" implementation should be 570 times slower in Haskell than OCaml - could someone investigate and fix the test?

No wonder it is so slow, this program looks as a result of some ,,as slow as possible'' contest ;)

It took me half an hour to make a version which is 41 times faster on a 5MB file. It should be possible to make it even 2-3 times faster than this. Best regards, Tom -- .signature: Too many levels of symbolic links

Tomasz Zielonka

11:16 a.m.

On Tue, Sep 28, 2004 at 12:49:52PM +0200, Tomasz Zielonka wrote:

...

On Tue, Sep 28, 2004 at 12:01:11PM +0200, Tomasz Zielonka wrote:

...
On Tue, Sep 28, 2004 at 10:46:14AM +0100, Keith Wansbrough wrote:

...
I just saw this on the OCaml list (in a posting by "Rafael 'Dido' Sevilla" in the "Observations on OCaml vs. Haskell" thread). I can't believe that a simple "wc" implementation should be 570 times slower in Haskell than OCaml - could someone investigate and fix the test?

No wonder it is so slow, this program looks as a result of some ,,as slow as possible'' contest ;)

It took me half an hour to make a version which is 41 times faster on a 5MB file. It should be possible to make it even 2-3 times faster than this.

Changed readArray to unsafeRead, and it is 47 times faster now. I must say I am pleasantly surprised that GHC managed to unbox everything there was to unbox without much annotations. For 5MB file the program allocated only 192KB in the heap. Especially optimisation of higher-level constructs like 'fmap (toEnun . fromEnum) ...' is very nice. Code attached. Feel free to improve it. Best regards, Tom -- .signature: Too many levels of symbolic links

Henning Thielemann

11:24 a.m.

On Tue, 28 Sep 2004, Tomasz Zielonka wrote:

...

Changed readArray to unsafeRead, and it is 47 times faster now.

I must say I am pleasantly surprised that GHC managed to unbox everything there was to unbox without much annotations. For 5MB file the program allocated only 192KB in the heap. Especially optimisation of higher-level constructs like 'fmap (toEnun . fromEnum) ...' is very nice.

Now I like to see an implementation which is both elegant and fast ... :-)

Greg Buchholz

5 Oct 5 Oct

6:22 p.m.

Keith Wansbrough wrote:

...

I just saw this on the OCaml list (in a posting by "Rafael 'Dido' Sevilla" in the "Observations on OCaml vs. Haskell" thread). I can't believe that a simple "wc" implementation should be 570 times slower in Haskell than OCaml - could someone investigate and fix the test?

I've been looking at the other shootout results (with the hope of learning something about making haskell programs faster/less memory hungry) and I couldn't quite figure out why the "Hashes, part II" test comsumes so much memory ( http://shootout.alioth.debian.org/bench/hash2/ ). So I started to try heap profiling, and generated the following graphs for the different types of profiles... biography => http://sleepingsquirrel.org/haskell/hash2_b.ps retainer => http://sleepingsquirrel.org/haskell/hash2_r.ps closure => http://sleepingsquirrel.org/haskell/hash2_d.ps type => http://sleepingsquirrel.org/haskell/hash2_y.ps cost cntr => http://sleepingsquirrel.org/haskell/hash2_c.ps ...but I have a hard time figuring out how to prevent something like "stg_ap_3_upd_info" or "void" cells from consuming so much memory. Anyone have pointers on how to best use the profile information? I'm still trying to digest "Heap Profiling for Space Efficiency" http://portal.acm.org/citation.cfm?id=734156 Are there any other related papers out there? (Of course it might be the case that I need a FiniteMap tutorial) Here's the code in question... import System (getArgs) import Data.FiniteMap main = do [n] <- getArgs let get fm k = lookupWithDefaultFM fm 0 k let keys = map (\x -> "foo_" ++ show x) [0..9999] let hash1 = listToFM $ zip keys [0..9999] let hash2 = listToFM $ zip keys (repeat 0) let update k fm = addToFM_C (+) fm k (get hash1 k) let res = foldr update hash2 (concat $ replicate (read n) keys) putStrLn $ unwords $ map show [get hash1 "foo_1", get hash1 "foo_9999", get res "foo_1", get res "foo_9999"] Thanks, Greg Buchholz

Ketil Malde

6 Oct 6 Oct

7:49 a.m.

Greg Buchholz writes:

...

I've been looking at the other shootout results (with the hope of learning something about making haskell programs faster/less memory hungry) and I couldn't quite figure out why the "Hashes, part II" test comsumes so much memory ( http://shootout.alioth.debian.org/bench/hash2/ ). So I started to try heap profiling, and generated the following graphs for the different types of profiles...

...

biography => http://sleepingsquirrel.org/haskell/hash2_b.ps retainer => http://sleepingsquirrel.org/haskell/hash2_r.ps closure => http://sleepingsquirrel.org/haskell/hash2_d.ps type => http://sleepingsquirrel.org/haskell/hash2_y.ps cost cntr => http://sleepingsquirrel.org/haskell/hash2_c.ps

...but I have a hard time figuring out how to prevent something like "stg_ap_3_upd_info" or "void" cells from consuming so much memory.

One thing you could do, is to move the pure definitions (constants and functions) out of the monad. This will make them separate cost centres, with their own profile information. I toyed with this, but admittedly, it didn't change much in this case. I think it is better style, though. A simple way to improve speed marginally, is to specify Int instead of letting things default to Integer. A more complex way, saving about 60% of the time, is to use unboxed arrays instead of strings for the keys - memory consumption seems to be the same, though. To get memory consumption down, I tried a strict "update" function: update k fm = let x = (get hash1 k + get fm k) in x `seq` addToFM fm k x which slowed the program down(!), but reduced memory consumption from about 25Mb to 1.5Mb. So it seems that the memory consumption is due to unevaluated values in the FMs. BTW, I looked at the shootout web pages, but I couldn't find the specification for any of the benchmarks. What is and isn't allowed? -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde

8:26 a.m.

Ketil Malde writes:

...

To get memory consumption down, I tried a strict "update" function:

...

update k fm = let x = (get hash1 k + get fm k) in x `seq` addToFM fm k x

...

which slowed the program down(!),

I wonder if this isn't due to never evaluating the values for "foo_2" to "foo_9998" because of laziness?

...

BTW, I looked at the shootout web pages, but I couldn't find the specification for any of the benchmarks. What is and isn't allowed?

For instance, changing the order of of the updates shaves another 10-20% off the time (because of cache-friendliness, I suppose): - let res = foldr update hash2 (concat $ replicate (read n) keys) + let res = foldr update hash2 (concat $ map (replicate (read n)) keys) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Greg Buchholz

6:04 p.m.

Ketil Malde wrote:

...

Ketil Malde writes:

...
To get memory consumption down, I tried a strict "update" function:

...
update k fm = let x = (get hash1 k + get fm k) in x `seq` addToFM fm k x

...
which slowed the program down(!),

Yes that fixes(?) it. The strict update removes the space leak and makes the FiniteMap perform as expected.

...

I wonder if this isn't due to never evaluating the values for "foo_2" to "foo_9998" because of laziness?

Maybe. On a whim I thought that maybe unevaluated addition thunks from addToFM_C were piling up, so I changed to addListToFM_C instead... let update k fm = addListToFM_C (+) fm $ replicate (read n) (k,gethash1 k) let res = foldr update hash2 keys ...but that hardly made a difference. So the search continues. Greg Buchholz

7572

Age (days ago)

7582

Last active (days ago)

List overview

Download

43 comments

18 participants

participants (18)

Andre Pang
Andrew Cheadle
Georg Martius
Gour
Graham Klyne
Graham Klyne
Greg Buchholz
Henning Thielemann
Josef Svenningsson
Keith Wansbrough
Ketil Malde
Kevin Everets
Malcolm Wallace
Peter Simons
Robert
Ronny Wichers Schreur
Sam Mason
Tomasz Zielonka