Performance Tuning & darcs (a real shootout?)

Hello, This will be a long email so I've tried to give it sections. = Background = I've been using Haskell a while now (close to a year) and I'm still very much a newbie (but I've used other FP langs). I'm also very interested in darcs which, as many of you know, is written in Haskell. I've found a case which darcs does not handle efficiently and I'm trying to get the performance in that case to a satisfactory level. If you create a new repository, create a single large file (say 300mb) and record that file darcs will take a very long time and uses a lot of memory (usually at least 900mb for the 300mb case). Before I begin telling you what I've tried so far, let me give a brief history of previous efforts by others to optimize darcs. As far as I know, the main person to work on optimizing darcs in the past is Ian Lynagh. He was able to convert darcs from stricter algorithms to lazy ones and this made a huge improvement in performance. In theory darcs should be able to lazily read from a file, compute the patch that needs to be written, write that patch and then lazily read the patch while it is applied to the repository. This mostly works as expected, but I've discovered that somewhere near the end of the application the memory needed spikes dramatically. After almost two weeks of poking at darcs doing various benchmarks and profiles I've realized that optimizing Haskell programs is no easy task. I've been following the advice of numerous people from the haskell irc channel and learned a lot about darcs in the process. I've also been using this nifty library that Ian created for this purpose to get a measure for the non-mmap memory usage: http://urchin.earth.li/darcs/ian/memory Potentially useful information about darcs; 1) Uses a slightly modified version of FastPackedStrings. 2) Can use mmap or not to read files (compile time option). =Experiments and Findings= I have a summary of some of my experimentation with darcs here: http://codersbase.com/index.php/Darcs_performance Basically what I have found is that the read of the original file does not cause a spike in memory usage, nor does writing the patch. This would seem to imply that it's during application of the patch that the memory spikes. Modifying darcs to read the patch file and print just the first line of the patch causes some interesting results. The memory usage according to Ian's memory tool stays very low, at about 150kb max, but requesting the first line of the patch appears to make darcs read the entire patch! Darcs will literally grind away for, say, 30 minutes to just print the first line. On a side note, I've tried turing off mmap and running some of the above experiments. Ian's tool reports the same memory usage, and top still reports large amounts of memory used. Does ghc use mmap to allocate memory instead of malloc? Even if it does this shouldn't be a problem for Ian's tool as long as it maps it anonymously. =Questions= So far I've been tracking this performance problem by reading the output of ghc --show-iface and --ddump-simpl for strictness information, using the ghc profiler (although that makes already bad performance much worse), Ian's memory tool, and a lot of experiments and guess work with program modifications. Is there a better way? Are there tools or techniques that can help me understand why the memory consumption peaks when applying a patch? Is it foolish to think that lazy evaluation is the right approach? I've been thinking that perhaps darcs should be modified to use bounded buffers so that we have tight control over the amount of memory consumed during a run. This solution would require a lot of reworking of the existing code, and sounds frightful from a maintenance point of view. I'm also wondering if using mmap is a bad idea. Given the way files are currently mmap'd I think we are limiting darcs to handling files which are small enough to fit in the address space at once (eg., < 4GB on i386). Additionally, it would seem that using mmap does not really reduce memory stress. I'm looking for advice or help in optimizing darcs in this case. I guess this could be viewed as a challenge for people that felt like the micro benchmarks of the shootout were unfair to Haskell. Can we demonstrate that Haskell provides good performance in the real-world when working with large files? Ideally, darcs could easily work with a patch that is 10GB in size using only a few megs of ram if need be and doing so in about the time it takes read the file once or twice and gzip it. If anyone wants to look at the darcs code themselves the unstable version is available via: darcs get --partial http://abridgegame.org/repos/darcs-unstable Just to recap, I'm looking for 1) advice, 2) tips, 3) design ideas, 4) tools, 5) libraries and just about anything else :) Thanks, Jason

Hello Jason, Monday, January 23, 2006, 9:38:02 AM, you wrote: JD> Potentially useful information about darcs; JD> 1) Uses a slightly modified version of FastPackedStrings. what you mean? afaik, there is no standard FastPackedString implementation, but there is some library that with minimal modifications used in darcs, jhc and many other apps JD> still reports large amounts of memory used. Does ghc use mmap to JD> allocate memory instead of malloc? afaik, the answer is yes. look at ghc-6.4.1-src\ghc\rts\MBlock.c JD> I'm looking for advice or help in optimizing darcs in this case. I JD> guess this could be viewed as a challenge for people that felt like JD> the micro benchmarks of the shootout were unfair to Haskell. Can we JD> demonstrate that Haskell provides good performance in the real-world JD> when working with large files? Ideally, darcs could easily work with JD> a patch that is 10GB in size using only a few megs of ram if need be JD> and doing so in about the time it takes read the file once or twice JD> and gzip it. it seems that Ian just used this as memory/time-efficient alternative for hGetContents. reading from memory-mapped file may be done as pure computation if the whole file is mapped. is this used in darcs? if not, then we can map small block at the time or even just use hGetBuf - at least, for first version. what you will say about this? JD> Just to recap, I'm looking for 1) advice, 2) tips, 3) design ideas, JD> 4) tools, 5) libraries and just about anything else :) i'll be happy to help, but i'm not ready to study the darcs code. i'm ready to solve more concrete problems - such as providing new input mechanism with given external interface or optimizing something rather small -- Best regards, Bulat mailto:bulatz@HotPOP.com

On Jan 23, 2006, at 3:33 AM, Bulat Ziganshin wrote:
what you mean? afaik, there is no standard FastPackedString implementation, but there is some library that with minimal modifications used in darcs, jhc and many other apps
I considered the version at Don Stewart's web site to be the standard, perhaps that was silly of me.
JD> still reports large amounts of memory used. Does ghc use mmap to JD> allocate memory instead of malloc?
afaik, the answer is yes. look at ghc-6.4.1-src\ghc\rts\MBlock.c
I've been doing the bulk of my testing in linux on i386, and if I'm reading the ifdefs correctly that is mmap'd anonymously. So at least I would expect Ian's memory tool to report the memory allocations correctly on that platform.
JD> I'm looking for advice or help in optimizing darcs in this case. I JD> guess this could be viewed as a challenge for people that felt like JD> the micro benchmarks of the shootout were unfair to Haskell. Can we JD> demonstrate that Haskell provides good performance in the real- world JD> when working with large files? Ideally, darcs could easily work with JD> a patch that is 10GB in size using only a few megs of ram if need be JD> and doing so in about the time it takes read the file once or twice JD> and gzip it.
it seems that Ian just used this as memory/time-efficient alternative for hGetContents. reading from memory-mapped file may be done as pure computation if the whole file is mapped. is this used in darcs?
I'm not sure, I have looked at the code but I can't tell. I think that was the point with the mmap'd files. There are several layers of abstraction at work here. Slurpies, PackedStrings, (custom) Lazy Reader monad, and maybe some others.
if not, then we can map small block at the time or even just use hGetBuf - at least, for first version. what you will say about this?
I'd expect it to be horribly slow, but if it gives guarantees about memory usage then I bet we could tune it up and make it tolerable.
i'll be happy to help, but i'm not ready to study the darcs code. i'm ready to solve more concrete problems - such as providing new input mechanism with given external interface or optimizing something rather small
That reminds me, you wrote FastIO right? I haven't looked at FastIO, do you think it would apply here? Thanks, Jason

On Mon, Jan 23, 2006 at 08:37:55PM -0800, Jason Dagit wrote:
On Jan 23, 2006, at 3:33 AM, Bulat Ziganshin wrote:
what you mean? afaik, there is no standard FastPackedString implementation, but there is some library that with minimal modifications used in darcs, jhc and many other apps
I considered the version at Don Stewart's web site to be the standard, perhaps that was silly of me.
Actually, FastPackedString originated in darcs, and Don separated it out as a library, making minimal modifications... :) FastPackedString started out with the Data.PackedString code, which I modified to hold the data in a ForeignPtr and to treat the data as always being 8 bit words. Plus the mmap stuff and splitting. I renamed it when I deviated from the Data.PackedString interface (which is the module I used to use).
it seems that Ian just used this as memory/time-efficient alternative for hGetContents. reading from memory-mapped file may be done as pure computation if the whole file is mapped. is this used in darcs?
I'm not sure, I have looked at the code but I can't tell. I think that was the point with the mmap'd files. There are several layers of abstraction at work here. Slurpies, PackedStrings, (custom) Lazy Reader monad, and maybe some others.
Patch files are normally stored compressed, so we can't mmap them and treat them nicely. For crazy-size repositories one might prefer to store them uncompressed, at least on one's working repository (presumably not in the publicly available repo). If the patches aren't compressed, they are indeed mmapped as an entire file. And alas, even I don't have much time at the moment to help with optimizing darcs. -- David Roundy http://www.darcs.net

Hi Jason, Jason Dagit wrote:
After almost two weeks of poking at darcs doing various benchmarks and profiles I've realized that optimizing Haskell programs is no easy task. I've been following the advice of numerous people from the haskell irc channel and learned a lot about darcs in the process. I've also been using this nifty library that Ian created for this purpose to get a measure for the non-mmap memory usage: http://urchin.earth.li/darcs/ian/memory
Potentially useful information about darcs; 1) Uses a slightly modified version of FastPackedStrings. 2) Can use mmap or not to read files (compile time option).
=Experiments and Findings= I have a summary of some of my experimentation with darcs here: http://codersbase.com/index.php/Darcs_performance
You can get a quick picture of heap usage with +RTS -Sstderr, by the way. To find out what's actually in that heap, you'll need heap profiling (as you know).
Basically what I have found is that the read of the original file does not cause a spike in memory usage, nor does writing the patch. This would seem to imply that it's during application of the patch that the memory spikes. Modifying darcs to read the patch file and print just the first line of the patch causes some interesting results. The memory usage according to Ian's memory tool stays very low, at about 150kb max, but requesting the first line of the patch appears to make darcs read the entire patch! Darcs will literally grind away for, say, 30 minutes to just print the first line.
On a side note, I've tried turing off mmap and running some of the above experiments. Ian's tool reports the same memory usage, and top still reports large amounts of memory used. Does ghc use mmap to allocate memory instead of malloc? Even if it does this shouldn't be a problem for Ian's tool as long as it maps it anonymously.
Yes, GHC's heap is mmap()'d anonymously. You really need to find out whether the space leak is mmap()'d by GHC's runtime, or by darcs itself - +RTS -Sstderr or profiling will tell you about GHC's memory usage.
=Questions= So far I've been tracking this performance problem by reading the output of ghc --show-iface and --ddump-simpl for strictness information, using the ghc profiler (although that makes already bad performance much worse), Ian's memory tool, and a lot of experiments and guess work with program modifications. Is there a better way?
I'd start by using heap profiling to track down what the space leak consists of, and hopefully to give you enough information to diagnose it. Let's see some heap profiles! Presumably the space leak is just as visible with smaller patches, so you don't need the full 300M patch to investigate it. I don't usually resort to -ddump-simpl until I'm optimising the inner loop, use profiling to find out where the inner loops actually *are* first.
Are there tools or techniques that can help me understand why the memory consumption peaks when applying a patch? Is it foolish to think that lazy evaluation is the right approach?
Since you asked, I've never been that keen on mixing laziness and I/O. Your experiences have strengthened that conviction - if you want strict control over resource usage, laziness is always going to be problematic. Sure it's great if you can get it right, the code is shorter and runs in small constant space. But can you guarantee that it'll still have the same memory behaviour with the next version of the compiler? With a different compiler? If you want guarantees about resource usage, which you clearly do, then IMHO you should just program the I/O explicitly and avoid laziness. It'll be a pain in the short term, but a win in the long term.
I'm looking for advice or help in optimizing darcs in this case. I guess this could be viewed as a challenge for people that felt like the micro benchmarks of the shootout were unfair to Haskell. Can we demonstrate that Haskell provides good performance in the real-world when working with large files? Ideally, darcs could easily work with a patch that is 10GB in size using only a few megs of ram if need be and doing so in about the time it takes read the file once or twice and gzip it.
I'd love to help you look into it, but I don't really have the time. I'm happy to help out with advice where possible, though. Cheers, Simon

On Jan 24, 2006, at 1:55 AM, Simon Marlow wrote:
You can get a quick picture of heap usage with +RTS -Sstderr, by the way. To find out what's actually in that heap, you'll need heap profiling (as you know).
[snip]
Yes, GHC's heap is mmap()'d anonymously. You really need to find out whether the space leak is mmap()'d by GHC's runtime, or by darcs itself - +RTS -Sstderr or profiling will tell you about GHC's memory usage.
Ah, I had been using little s, but I forgot about the existence of big S. I'll try to include some profiles and the knowledge gained by using it. I wish I could work on that right now but chances are it will be Monday or Tuesday before I get to look at it again.
I'd start by using heap profiling to track down what the space leak consists of, and hopefully to give you enough information to diagnose it. Let's see some heap profiles!
Yes!
Presumably the space leak is just as visible with smaller patches, so you don't need the full 300M patch to investigate it.
This is true, I've had problems with even 30mb patches. I guess I liked using the 300mb patch because it emphasized and exaggerated the performance and often I left the profile running on one machine while I went off studying the code on another. But, it's a good suggestion for when I want to be able to iterate or get my results sooner.
I don't usually resort to -ddump-simpl until I'm optimising the inner loop, use profiling to find out where the inner loops actually *are* first.
Point taken.
Are there tools or techniques that can help me understand why the memory consumption peaks when applying a patch? Is it foolish to think that lazy evaluation is the right approach?
Since you asked, I've never been that keen on mixing laziness and I/ O. Your experiences have strengthened that conviction - if you want strict control over resource usage, laziness is always going to be problematic. Sure it's great if you can get it right, the code is shorter and runs in small constant space. But can you guarantee that it'll still have the same memory behaviour with the next version of the compiler? With a different compiler?
And I've heard others say that laziness adds enough unpredictability that it makes optimizing just that much trickier. I guess this may be one of the cases where the "trickiness" outweighs the elegance.
I'm looking for advice or help in optimizing darcs in this case. I guess this could be viewed as a challenge for people that felt like the micro benchmarks of the shootout were unfair to Haskell. Can we demonstrate that Haskell provides good performance in the real-world when working with large files? Ideally, darcs could easily work with a patch that is 10GB in size using only a few megs of ram if need be and doing so in about the time it takes read the file once or twice and gzip it.
I'd love to help you look into it, but I don't really have the time. I'm happy to help out with advice where possible, though.
Several people have spoken up and said, "I'd help but I'm busy" including droundy himself. This is fine, when I said "help" I was thinking of advice like you gave. It was a poor choice of phrasing on my part. I can work ghc and stare at lines of code, but sometimes I need guidance since I'm mostly out of my league in this case. Thanks, Jason
participants (4)
-
Bulat Ziganshin
-
David Roundy
-
Jason Dagit
-
Simon Marlow