Real-time garbage collection for Haskell

I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal. Also any ideas for reducing this latency in other ways would be very appreciated. Thanks, Luke

Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects? Pavel. On 28.02.2010, at 8:20, Luke Palmer wrote:
I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal. Also any ideas for reducing this latency in other ways would be very appreciated.
Thanks, Luke _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

My experience agrees with Pavel. I've never observed ones that size. I have an application that runs in 'rate equivalent real-time' (i.e. there may be some jitter in the exact time of events but it does not accumulate). It does have some visibility of likely time of future events and uses that to perform some speculative garbage collection. GC is pretty short and i've not seen an effect > 1ms in those runs (all the usual caveats apply - my programs are not your programs etc). Neil On 28 Feb 2010, at 09:06, Pavel Perikov wrote:
Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?
Pavel.
On 28.02.2010, at 8:20, Luke Palmer wrote:
I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal. Also any ideas for reducing this latency in other ways would be very appreciated.
Thanks, Luke _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

2010/02/28 Neil Davies
I've never observed ones that size. I have an application that runs in 'rate equivalent real-time' (i.e. there may be some jitter in the exact time of events but it does not accumulate). It does have some visibility of likely time of future events and uses that to perform some speculative garbage collection.
Do you have information on how it behaves without speculative GC? -- Jason Dusek

Sorry, no. We wanted a basic bound on the jitter - the application is not one that creates much (if any) long lived heap. Having just seen Simon's email on the fact that performGC forces a major GC - i think that there is some new mileage here with making the speculative GC's minor ones. More control needs some more instrumentation of how much mutation is occurring and ways of estimating how much of that is short and long lived - I know that past history is not necessarily a good indicator of future actions - but visibility of the counters that being kept would help. Neil On 3 Mar 2010, at 00:00, Jason Dusek wrote:
2010/02/28 Neil Davies
: I've never observed ones that size. I have an application that runs in 'rate equivalent real-time' (i.e. there may be some jitter in the exact time of events but it does not accumulate). It does have some visibility of likely time of future events and uses that to perform some speculative garbage collection.
Do you have information on how it behaves without speculative GC?
-- Jason Dusek

On Sun, Feb 28, 2010 at 2:06 AM, Pavel Perikov
Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?
This is all hypothetical right now. I heard some horror stories in which people had to switch to the main game loop in C++ and only do the AI logic in Haskell because of pauses. I would rather not do that, especially because this project is *about* proving Haskell as a viable game development platform. So I am trying to be prepared if I do see something like that, so that it doesn't put the show on hold for a few months. Presumably large, long-living objects would cause the generation 0 collections to take a long time. I am not sure if we will have any said objects, but we can't rule it out... Thanks for the positive reassurances, at least. I'd like to hear from people with the opposite experience, if there are any. Luke

Sounds like we need to come up with some benchmarking programs so we
can measure the GC latency and soft-realtimeness...
PS: Regarding Haskell and games: the University of Utrecht teaches
Haskell in their brand new "game technology" course :-)
On Mon, Mar 1, 2010 at 1:04 AM, Luke Palmer
On Sun, Feb 28, 2010 at 2:06 AM, Pavel Perikov
wrote: Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?
This is all hypothetical right now. I heard some horror stories in which people had to switch to the main game loop in C++ and only do the AI logic in Haskell because of pauses. I would rather not do that, especially because this project is *about* proving Haskell as a viable game development platform. So I am trying to be prepared if I do see something like that, so that it doesn't put the show on hold for a few months.
Presumably large, long-living objects would cause the generation 0 collections to take a long time. I am not sure if we will have any said objects, but we can't rule it out...
Thanks for the positive reassurances, at least. I'd like to hear from people with the opposite experience, if there are any.
Luke _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Monday 01 March 2010 01:04:37 am Luke Palmer wrote:
On Sun, Feb 28, 2010 at 2:06 AM, Pavel Perikov
wrote: Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?
This is all hypothetical right now. I heard some horror stories in which people had to switch to the main game loop in C++ and only do the AI logic in Haskell because of pauses. I would rather not do that, especially because this project is *about* proving Haskell as a viable game development platform. So I am trying to be prepared if I do see something like that, so that it doesn't put the show on hold for a few months.
Presumably large, long-living objects would cause the generation 0 collections to take a long time. I am not sure if we will have any said objects, but we can't rule it out...
Thanks for the positive reassurances, at least. I'd like to hear from people with the opposite experience, if there are any.
Yes there are. I am working on a game with Haskell using OpenGL rendering. I've done some frame time measurements lately and encountered single frames needing more than 100ms to be rendered. I am currently trying to gather information on what is going on in these 100ms and why. From what i understand, the GC is running very often and just some (very few) of its runs are very slow. BTW: switching to parallel GC (either with -N1 or -N2 (on a dual core machine)) made the behavior MUCH worse, for some reason. Sönke
Luke _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 01/03/2010 14:16, Sönke Hahn wrote:
On Monday 01 March 2010 01:04:37 am Luke Palmer wrote:
On Sun, Feb 28, 2010 at 2:06 AM, Pavel Perikov
wrote: Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?
This is all hypothetical right now. I heard some horror stories in which people had to switch to the main game loop in C++ and only do the AI logic in Haskell because of pauses. I would rather not do that, especially because this project is *about* proving Haskell as a viable game development platform. So I am trying to be prepared if I do see something like that, so that it doesn't put the show on hold for a few months.
Presumably large, long-living objects would cause the generation 0 collections to take a long time. I am not sure if we will have any said objects, but we can't rule it out...
Thanks for the positive reassurances, at least. I'd like to hear from people with the opposite experience, if there are any.
Yes there are. I am working on a game with Haskell using OpenGL rendering. I've done some frame time measurements lately and encountered single frames needing more than 100ms to be rendered. I am currently trying to gather information on what is going on in these 100ms and why. From what i understand, the GC is running very often and just some (very few) of its runs are very slow.
BTW: switching to parallel GC (either with -N1 or -N2 (on a dual core machine)) made the behavior MUCH worse, for some reason.
Probably due to cache effects - shipping the data off to a different core in the GC can far outweigh any benefits you would have got by traversing the heap in parallel. For a single-threaded program, parallel GC tends to only help when the heap gets large (over 50MB at a guess). For parallel programs, parallel GC helps by keeping the data in the right cache. I'm finding that for parallel programs turning off load-balancing with +RTS -qb often helps, because load-balancing is bad for locality. Again, if the heap gets large, then collecting it in parallel is probably more important than keeping it in the cache. Cheers, Simon

On Monday 01 March 2010 03:16:25 pm Sönke Hahn wrote:
Yes there are. I am working on a game with Haskell using OpenGL rendering. I've done some frame time measurements lately and encountered single frames needing more than 100ms to be rendered. I am currently trying to gather information on what is going on in these 100ms and why. From what i understand, the GC is running very often and just some (very few) of its runs are very slow.
FYI: These high frame times were caused by a space leak. With the help of the excellent hp2any-manager i found that out and where to put the needed strictness annotations. Sönke

On 01/03/2010 00:04, Luke Palmer wrote:
On Sun, Feb 28, 2010 at 2:06 AM, Pavel Perikov
wrote: Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?
This is all hypothetical right now. I heard some horror stories in which people had to switch to the main game loop in C++ and only do the AI logic in Haskell because of pauses. I would rather not do that, especially because this project is *about* proving Haskell as a viable game development platform. So I am trying to be prepared if I do see something like that, so that it doesn't put the show on hold for a few months.
Presumably large, long-living objects would cause the generation 0 collections to take a long time. I am not sure if we will have any said objects, but we can't rule it out...
Thanks for the positive reassurances, at least. I'd like to hear from people with the opposite experience, if there are any.
By default GHC uses a generational GC with a 512KB nursery size, which means that most GC pauses will be very short but just occasionally you have to do a major GC and there's no upper bound on the pause time for that collection, because the whole live heap is traversed. The pause time for a major collection is proportional to the amount of live data, so the people who are experiencing very short pauses probably have little live data and/or have allocation patterns that means the old generation is collected very infrequently. Monolithic major GC is a big stinking scalability problem, and the only solutions are to do concurrent GC, incremental GC, or region-based GC. Both concurrent GC and incremental GC tend to add overheads to the mutator, because they need a read barrier. There was an incremental GC for GHC once [1], taking advantage of the built-in read barrier that we have whereby most closures are "entered", and it more-or-less worked but was quite complicated and had some code-size overhead. Nowadays part of this built-in read barrier has been eroded by pointer tagging, which makes things a bit more tricky. Region-based GC is a generalisation of generational GC, where you divide the heap into regions and track all pointers between regions, so a given collection can collect any subset of the regions. This basic idea is quite popular amongst the Java people these days, Thomas mentioned the G1 collector which does this and uses a concurrent mark phase to identify which regions to collect. Regardless of how you figure out which regions to collect, the basic idea of breaking up the old generation like this is a good way to reduce pause times. At the moment I'm focussed on something different: making individual processors able to collect their own heaps independently, so removing the stop-the-world requirement from minor GCs. This should help parallel throughput a lot, but won't fix the major GC pause times. I am slightly worried that the GC can't bear the extra complexity of adding both processor-independent GC *and* region-based GC or some other pause-time-reducing technique. But we'll see. I'll happily supervise/mentor anyone who wants to tackle this. Cheers, Simon [1] http://delivery.acm.org/10.1145/360000/351265/p257-cheadle.pdf?key1=351265&key2=8540457621&coll=GUIDE&dl=GUIDE&CFID=80115628&CFTOKEN=59704548

Both concurrent GC and incremental GC tend to add overheads to the mutator, because they need a read barrier. There was an incremental GC for GHC once [1], taking advantage of the built-in read barrier that we have whereby most closures are "entered"
Was there a specific reason why that GC implementation chose to use a read barrier rather than a write barrier? I would have thought that in general, a write barrier is cheaper to implement. Doesn't ghc update fewer thunks than it enters? Regards, Malcolm

On 02/03/2010 14:11, Malcolm Wallace wrote:
Both concurrent GC and incremental GC tend to add overheads to the mutator, because they need a read barrier. There was an incremental GC for GHC once [1], taking advantage of the built-in read barrier that we have whereby most closures are "entered"
Was there a specific reason why that GC implementation chose to use a read barrier rather than a write barrier? I would have thought that in general, a write barrier is cheaper to implement. Doesn't ghc update fewer thunks than it enters?
If the GC wants to move objects, then a read barrier is needed in order to figure out whether you're looking at an object that has moved or not. If you don't move objects, then you can get away with only a write barrier - the write barrier is needed so that you can remember which objects or fields might need to be re-scanned because they were mutated. The choice made in the Non-stop Haskell paper was to take advantage of the existing read barrier so that we could move objects. Some schemes copy objects rather than move them, and then you can get away without a read barrier, but then your write barrier has to keep the two copies in sync. Tradeoffs are everywhere, GC is a tricky business! Cheers, Simon

On Sun, Feb 28, 2010 at 6:06 PM, Pavel Perikov
Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?
I am using an automated options trading system written in Haskell. I'm more on the business side than the technical side of the issues so I'm not clear on all the details. I can confirm that without tweaking the RTS settings we were seeing over 100ms GC pauses. I've mainly been trying to minimise our overall response time and we were able to improve this by increasing the allocation area with -A. I think this brought GC well under 100ms. We are still working on analysis of this. I can also confirm, as others seem to have found, that under 6.12 the parallel GC seemed to make things much worse. I am always turning it off with -qg. If there is a project to improve performance of the GC I could be interested to contribute. Simon Cranshaw

Simon,
Would a more predictable GC or a faster GC be better in your case? (Of
course, both would be nice.)
/jve
On Mon, Mar 1, 2010 at 9:33 PM, Simon Cranshaw
On Sun, Feb 28, 2010 at 6:06 PM, Pavel Perikov
wrote: Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?
I am using an automated options trading system written in Haskell. I'm more on the business side than the technical side of the issues so I'm not clear on all the details. I can confirm that without tweaking the RTS settings we were seeing over 100ms GC pauses. I've mainly been trying to minimise our overall response time and we were able to improve this by increasing the allocation area with -A. I think this brought GC well under 100ms. We are still working on analysis of this.
I can also confirm, as others seem to have found, that under 6.12 the parallel GC seemed to make things much worse. I am always turning it off with -qg. If there is a project to improve performance of the GC I could be interested to contribute.
Simon Cranshaw
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

John,
I suspect speed is more important than timing. In practice I don't mind a
pause for a gc when nothing is happening in the market. It's when the
market is moving fast that we particularly want to keep our response time
low. Busy times may continue for long periods though and I'm not sure if
being able to defer gc (if that's what you're suggesting) would be possible
for that long. We are still studying how the gc times are interacting with
our response times and so hopefully I will have a better answer to this
later on.
Simon
On Tue, Mar 2, 2010 at 2:06 PM, John Van Enk
Simon,
Would a more predictable GC or a faster GC be better in your case? (Of course, both would be nice.)
/jve
On Mon, Mar 1, 2010 at 9:33 PM, Simon Cranshaw
wrote: On Sun, Feb 28, 2010 at 6:06 PM, Pavel Perikov
wrote: Did you really seen 100ms pauses?! I never did extensive research on this but my numbers are rather in microseconds range (below 1ms). What causes such a long garbage collection? Lots of allocated and long-living objects?
I am using an automated options trading system written in Haskell. I'm more on the business side than the technical side of the issues so I'm not clear on all the details. I can confirm that without tweaking the RTS settings we were seeing over 100ms GC pauses. I've mainly been trying to minimise our overall response time and we were able to improve this by increasing the allocation area with -A. I think this brought GC well under 100ms. We are still working on analysis of this.
I can also confirm, as others seem to have found, that under 6.12 the parallel GC seemed to make things much worse. I am always turning it off with -qg. If there is a project to improve performance of the GC I could be interested to contribute.
Simon Cranshaw
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Luke Palmer wrote:
I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector.
I'm guessing making the GC concurrent (i.e., so you can perform GC without having to stop all Haskell threads) would probably help in the multithreaded case... (I'm also guessing this is slightly nontrivial to implement. :-) )

Luke Palmer wrote:
I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal.
Also any ideas for reducing this latency in other ways would be very appreciated.
Overly long garbage collection might also be a sign of space leaks. But there are many other things that can go wrong in a real time system and might explain your delays. For example, you might need to avoid amortized time data structures like Data.Sequence . Or for physics simulations, you'd need to fix the time step ∆t, as described in http://gafferongames.com/game-physics/fix-your-timestep/ or numerical integration will deteriorate rather quickly. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

On Sun, Feb 28, 2010 at 10:03 AM, Heinrich Apfelmus
Luke Palmer wrote:
I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal.
Also any ideas for reducing this latency in other ways would be very appreciated.
Overly long garbage collection might also be a sign of space leaks.
But there are many other things that can go wrong in a real time system and might explain your delays. For example, you might need to avoid amortized time data structures like Data.Sequence . Or for physics simulations, you'd need to fix the time step ∆t, as described in
http://gafferongames.com/game-physics/fix-your-timestep/
or numerical integration will deteriorate rather quickly.
Incidentally, what's described there is a simplified version of the frequency locked loops described in Massalin's thesis on Synthesis OS, and it is used there for about the same purpose in a soft real-time context. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4871

On 28 February 2010 05:20, Luke Palmer
I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal. Also any ideas for reducing this latency in other ways would be very appreciated.
There is a SoC project suggestion to implement Immix's ideas [1] in GHC's GC. Both already use similar overall designs. Both split the heap into regions which may employ different collection strategies. However, Immix does not address real-time issues. The main difficulty with real-time GC is that, while first-generation collection is usually very fast, eventually you just have to collect the old generation and you have to do it all at once. Sun's new Garbage-First ("G1") [2] collector therefore tracks pointers between regions, as opposed to just pointers from older two newer generations. This allows collecting regions independently (and in parallel). G1 is still stop-the-world, although marking phase is concurrent. Tracking pointers between all regions can result in quite substantial space overheads, however, so G1 uses some heuristics to discover "popular objects" and treats them specially. In a personal conversation Simon Marlow expressed to me that he intends to go further into this direction, but I don't know how high-priority it is. In general I don't think true real-time is the goal in any case, but rather a general effort to keep GC-pauses short. Truly concurrent garbage collection is a whole different beast. Concurrent marking can be implemented efficiently with a write barrier. I don't know of any fully concurrent GC scheme that gets by without a read barrier and significant space overhead, however. There are certainly no plans from the GC HQ to implement a fully concurrent GC. [1]: http://www.cs.utexas.edu/users/speedway/DaCapo/papers/immix-pldi-2008.pdf [2]: http://research.sun.com/jtech/pubs/04-g1-paper-ismm.pdf -- Push the envelope. Watch it bend.

My current area of work is on realtime embedded software programming for
avionics systems. We do most of our coding in Ada but I've been dreaming of
using haskell instaed.
However, the garbage collector is actually one of the larger obsticles to
making this happen.
All of our avionics software needs to be certified by various regulatory
agencies, and there are varying levels of certification depending on
criticality. For the higher certification levels we would need to be able to
sure (or a least very very confidant) that the GC will collect everything
within a fixed amount of time, and that it won't take more than some fixed
amount of time per major from to do it.
A delay of a several milliseconds that could occur effectively at random is
completely unacceptable.
I would be very interested in alternative GC algorithms/approaches that
would have a more deterministic/realtime behavior. I would be even be
willing to help out if there is other interest in this area :)
As a side note, I ran across an article on a way to use 100% reference
counting in a pure language by using weak references and being careful how
you preserve the weak/strong references during graph reduction:
http://comjnl.oxfordjournals.org/cgi/content/abstract/33/5/466
I don't want to pay $25 for the article though so I don't know how viable it
is. It would probably have lower performance than the current generational
GC but in this case I'd be willing to trade performance for determinism.
Has anyone heard of this algorithm before?
- Job
On Mon, Mar 1, 2010 at 9:53 AM, Thomas Schilling
On 28 February 2010 05:20, Luke Palmer
wrote: I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal. Also any ideas for reducing this latency in other ways would be very appreciated.
There is a SoC project suggestion to implement Immix's ideas [1] in GHC's GC. Both already use similar overall designs. Both split the heap into regions which may employ different collection strategies. However, Immix does not address real-time issues.
The main difficulty with real-time GC is that, while first-generation collection is usually very fast, eventually you just have to collect the old generation and you have to do it all at once. Sun's new Garbage-First ("G1") [2] collector therefore tracks pointers between regions, as opposed to just pointers from older two newer generations. This allows collecting regions independently (and in parallel). G1 is still stop-the-world, although marking phase is concurrent. Tracking pointers between all regions can result in quite substantial space overheads, however, so G1 uses some heuristics to discover "popular objects" and treats them specially. In a personal conversation Simon Marlow expressed to me that he intends to go further into this direction, but I don't know how high-priority it is. In general I don't think true real-time is the goal in any case, but rather a general effort to keep GC-pauses short.
Truly concurrent garbage collection is a whole different beast. Concurrent marking can be implemented efficiently with a write barrier. I don't know of any fully concurrent GC scheme that gets by without a read barrier and significant space overhead, however. There are certainly no plans from the GC HQ to implement a fully concurrent GC.
[1]: http://www.cs.utexas.edu/users/speedway/DaCapo/papers/immix-pldi-2008.pdf [2]: http://research.sun.com/jtech/pubs/04-g1-paper-ismm.pdf
-- Push the envelope. Watch it bend. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 1 March 2010 16:27, Job Vranish
My current area of work is on realtime embedded software programming for avionics systems. We do most of our coding in Ada but I've been dreaming of using haskell instaed.
Do you really think this is realistic? Garbage collector aside, Haskell's execution model is very difficult to predict, which I would suspect is crucial for even soft real-time systems. The whole concept of lazy evaluation seems to run counter to the idea of real-time systems. Lazy evaluation essentially says "do as little as possible *now*" at the expense of having to do it all later. For a real-time system you want almost the opposite; you want to make sure that you complete all the required work in the current time slice. A possible workaround would be to sprinkle lots of 'rnf's around your code to make sure you don't build up a thunk or two that will delay you later. And if you do this, aren't you essentially programming in a strict functional language (like SML or O'Caml)? By careful profiling you and auditing you can probably rule out most of the potential bad cases, so it can be acceptable for a soft real-time system (Galois did something like this, I believe). But for avionics systems you probably want to more assurances than that, don't you? -- Push the envelope. Watch it bend.

The whole concept of lazy evaluation seems to run counter to the idea of real-time systems.
Hi Thomas,
Lazy evaluation is okay since it has deterministic characteristics. We can
predict what will happen quite accurately (heck, we can model it in simpler
cases). It might take a while to get people comfortable with the concept,
but it wouldn't be a show stopper (actually, some people would benefit
greatly from lazy evaluation and referential transparency).
The GC throws a wrench in a system which would otherwise make it past
certification with enough effort. If we can write a GC that can be modeled,
we'd have an excellent case for using Haskell in aerospace.
/jve
On Mon, Mar 1, 2010 at 2:37 PM, Thomas Schilling
On 1 March 2010 16:27, Job Vranish
wrote: My current area of work is on realtime embedded software programming for avionics systems. We do most of our coding in Ada but I've been dreaming of using haskell instaed.
Do you really think this is realistic? Garbage collector aside, Haskell's execution model is very difficult to predict, which I would suspect is crucial for even soft real-time systems. The whole concept of lazy evaluation seems to run counter to the idea of real-time systems. Lazy evaluation essentially says "do as little as possible *now*" at the expense of having to do it all later. For a real-time system you want almost the opposite; you want to make sure that you complete all the required work in the current time slice.
A possible workaround would be to sprinkle lots of 'rnf's around your code to make sure you don't build up a thunk or two that will delay you later. And if you do this, aren't you essentially programming in a strict functional language (like SML or O'Caml)? By careful profiling you and auditing you can probably rule out most of the potential bad cases, so it can be acceptable for a soft real-time system (Galois did something like this, I believe). But for avionics systems you probably want to more assurances than that, don't you?
-- Push the envelope. Watch it bend. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, Mar 1, 2010 at 2:37 PM, Thomas Schilling
On 1 March 2010 16:27, Job Vranish
wrote: My current area of work is on realtime embedded software programming for avionics systems. We do most of our coding in Ada but I've been dreaming of using haskell instaed.
A possible workaround would be to sprinkle lots of 'rnf's around your code to make sure you don't build up a thunk or two that will delay you later. And if you do this, aren't you essentially programming in a strict functional language (like SML or O'Caml)? By careful profiling you and auditing you can probably rule out most of the potential bad cases, so it can be acceptable for a soft real-time system (Galois did something like this, I believe). But for avionics systems you probably want to more assurances than that, don't you?
Yes and no. It's true that lazy evaluation makes reasoning about timings a bit more difficult (and might not be usable in very time critical scenarios) but it is still has well defined deterministic behavior. It's the referential transparency that saves us here. If you run a lazy function with the same objects (in the same evaluation state) it should _theoretically_ take the same amount of time to run. All of our toplevel inputs will be strict, and if we keep our frame-to-frame state strick, our variances in runtimes, given the same inputs, should be quite low modulo the GC. Even our current code can take significantly different amounts of time to compute things depending on what you're doing. Some waypoints take longer to lookup from the database than others. Predicting the time to arrival can take significantly longer/shorter depending on seemingly trivial parameters, etc... It matters less that code always takes the same amount of time to run (though it needs to always be less than the frame time) and more so that it always takes the same amount of time to run given the same initial conditions. - Job

I don't know that hanging your hat on the deterministic coat hook is the right thing to do. The way that I've always looked at this is more probabilistic - you want the result to arrive within a certain time frame for a certain operation with a high probability. There is always the probability that the h/w will fail anyway - you could even reason that the software taking too long is just a transient fault that clears - random (non-correlated - preferably a bernoulli choice) failures are OK, non-deterministic ones aren't. This probabilistic, low probability of being at the tail of timing, approach would give a lot more flexibility in any form of (say incremental) GC - you may not be able to bound the incremental steps absolutely but a strong probabilistic bound might well be more achievable. Neil On 1 Mar 2010, at 21:06, Job Vranish wrote:
wrote: On 1 March 2010 16:27, Job Vranish
wrote: My current area of work is on realtime embedded software On Mon, Mar 1, 2010 at 2:37 PM, Thomas Schilling
avionics systems. We do most of our coding in Ada but I've been dreaming of using haskell instaed.
A possible workaround would be to sprinkle lots of 'rnf's around your code to make sure you don't build up a thunk or two that will delay you later. And if you do this, aren't you essentially programming in a strict functional language (like SML or O'Caml)? By careful profiling you and auditing you can probably rule out most of the potential bad cases, so it can be acceptable for a soft real-time system (Galois did something like this, I believe). But for avionics systems you probably want to more assurances than that, don't you?
Yes and no. It's true that lazy evaluation makes reasoning about timings a bit more difficult (and might not be usable in very time critical scenarios) but it is still has well defined deterministic behavior.
It's the referential transparency that saves us here. If you run a lazy function with the same objects (in the same evaluation state) it should _theoretically_ take the same amount of time to run. All of our toplevel inputs will be strict, and if we keep our frame-to- frame state strick, our variances in runtimes, given the same inputs, should be quite low modulo the GC.
Even our current code can take significantly different amounts of time to compute things depending on what you're doing. Some waypoints take longer to lookup from the database than others. Predicting the time to arrival can take significantly longer/shorter depending on seemingly trivial parameters, etc...
It matters less that code always takes the same amount of time to run (though it needs to always be less than the frame time) and more so that it always takes the same amount of time to run given the same initial conditions.
- Job _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 1 March 2010 21:34, Neil Davies
I don't know that hanging your hat on the deterministic coat hook is the right thing to do. The way that I've always looked at this is more probabilistic - you want the result to arrive within a certain time frame for a certain operation with a high probability.
That's where I would think the difference between hard and soft real-time lies, as I understand it. Of course, "real" hard real-time of course is practically impossible on modern hardware due to caches, branch prediction, out-of-order execution and such like.
There is always the probability that the h/w will fail anyway - you could even reason that the software taking too long is just a transient fault that clears - random (non-correlated - preferably a bernoulli choice) failures are OK, non-deterministic ones aren't. This probabilistic, low probability of being at the tail of timing, approach would give a lot more flexibility in any form of (say incremental) GC - you may not be able to bound the incremental steps absolutely but a strong probabilistic bound might well be more achievable.
The Garbage-First paper showed some promising but not spectacular successes in keeping the soft real-time goal. I don't know the correct numbers off the top of my head, but I think they missed the deadline in > 5% of the cases. I would assume that it should be < 1% if you want to treat it as a random failure. I understand that in a fault-tolerant systems you have to handle worse things than that, but you make assumptions about the likelihood of each kind of error (otherwise you may run into QoS issues). As Job and John have pointed out, though, laziness per se doesn't seem to be an issue, which is good to hear. Space leaks might, but there is no clear evidence that they are particularly harder to avoid than in strict languages. As Röjemo and Runciman put it: "By using heap profiles on a lazy language we find problems with lazy languages. Using it on a strict language we would find problems with strict languages too." [1] (very recommended paper, btw.) [1]: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.1219 -- Push the envelope. Watch it bend.

On 2010-03-01 19:37 +0000 (Mon), Thomas Schilling wrote:
A possible workaround would be to sprinkle lots of 'rnf's around your code....
As I learned rather to my chagrin on a large project, you generally don't want to do that. I spent a couple of days writing instance of NFData and loading up my application with rnfs and then watched performance fall into a sinkhole. I believe that the problem is that rnf traverses the entirety of a large data structure even if it's already strict and doesn't need traversal. My guess is that doing this frequently on data structures (such as Maps) of less than tiny size was blowing out my cache. I switched strategies to forcing a deep(ish) evaluation of only newly constructed data instead. For example, after inserting a newly constructed object into a Map, I would look it up and force evaluation only of the result of that lookup. That solved my space leak problem and made things chug along quite nicely. Understanding the general techniques for this sort of thing and seeing where you're likely to need to apply them isn't all that difficult, once you understand the problem. (It's probably much easier if you don't have to work it out all for yourself, as I did. Someone needs to write the "how to manage lazyness in Haskell" guide.) The difficult part of it is that you've really got to stay on top of it, because if you don't, the space leaks come back and you have to go find them again. It feels a little like dealing with buffers and their lengths in C. On 2010-03-01 16:06 -0500 (Mon), Job Vranish wrote:
All of our toplevel inputs will be strict, and if we keep our frame-to-frame state strick, our variances in runtimes, given the same inputs, should be quite low modulo the GC.
This is exactly the approach I need to take for the trading system. I basically have various (concurrent) loops that process input, update state, and possibly generate output. The system runs for about six hours, processing five million or so input messages with other loops running anywhere from hundreds of thousands to millions of times. The trick is to make sure that I never, ever start a new loop with an unevaluated thunk referring to data needed only by the previous loop, because otherwise I just grow and grow and grow.... Some tool to help with this would be wonderful. There's something for y'all to think about. On 2010-03-01 22:01 +0000 (Mon), Thomas Schilling wrote:
As Job and John have pointed out, though, laziness per se doesn't seem to be an issue, which is good to hear. Space leaks might, but there is no clear evidence that they are particularly harder to avoid than in strict languages.
As I mentioned above, overall I find them so. Any individual space
leak you're looking at is easy to fix, but the constant vigilance is
difficult.
cjs
--
Curt Sampson

cjs:
On 2010-03-01 19:37 +0000 (Mon), Thomas Schilling wrote:
A possible workaround would be to sprinkle lots of 'rnf's around your code....
As I learned rather to my chagrin on a large project, you generally don't want to do that. I spent a couple of days writing instance of NFData and loading up my application with rnfs and then watched performance fall into a sinkhole.
I believe that the problem is that rnf traverses the entirety of a large data structure even if it's already strict and doesn't need traversal. My guess is that doing this frequently on data structures (such as Maps) of less than tiny size was blowing out my cache.
And rnf will do the traversal whether it is needed or not. Imo, it is better to ensure the structures you want are necessarily strict by definition, so that only the minimum additional evaluation is necessary. 'rnf' really is a hammer, but not everything is a nail. -- Don

On Thu, Mar 4, 2010 at 11:10 AM, Don Stewart
cjs:
On 2010-03-01 19:37 +0000 (Mon), Thomas Schilling wrote:
A possible workaround would be to sprinkle lots of 'rnf's around your code....
As I learned rather to my chagrin on a large project, you generally don't want to do that. I spent a couple of days writing instance of NFData and loading up my application with rnfs and then watched performance fall into a sinkhole.
I believe that the problem is that rnf traverses the entirety of a large data structure even if it's already strict and doesn't need traversal. My guess is that doing this frequently on data structures (such as Maps) of less than tiny size was blowing out my cache.
And rnf will do the traversal whether it is needed or not. Imo, it is better to ensure the structures you want are necessarily strict by definition, so that only the minimum additional evaluation is necessary.
Isn't the downside of strict structures the implicit nature of the 'strictification'? You lose the fine grained control of when particular values should be strict. Jason

Curt Sampson schrieb:
Understanding the general techniques for this sort of thing and seeing where you're likely to need to apply them isn't all that difficult, once you understand the problem. (It's probably much easier if you don't have to work it out all for yourself, as I did. Someone needs to write the "how to manage lazyness in Haskell" guide.)
My attempt in this direction: http://www.haskell.org/haskellwiki/Laziness_is_not_always_good http://www.haskell.org/haskellwiki/Maintaining_laziness

On 2010-03-05 10:50 +0100 (Fri), Henning Thielemann wrote:
Curt Sampson schrieb:
Understanding the general techniques for this sort of thing and seeing where you're likely to need to apply them isn't all that difficult, once you understand the problem. (It's probably much easier if you don't have to work it out all for yourself, as I did. Someone needs to write the "how to manage lazyness in Haskell" guide.)
My attempt in this direction: http://www.haskell.org/haskellwiki/Laziness_is_not_always_good http://www.haskell.org/haskellwiki/Maintaining_laziness
Unfortunately, neither of these address the problem of the day-to-day
programmer: "what are the typical ways I introduce space leaks, and how
do I stop doing that?"
cjs
--
Curt Sampson

On 01/03/2010 14:53, Thomas Schilling wrote:
On 28 February 2010 05:20, Luke Palmer
wrote: I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal. Also any ideas for reducing this latency in other ways would be very appreciated.
There is a SoC project suggestion to implement Immix's ideas [1] in GHC's GC. Both already use similar overall designs. Both split the heap into regions which may employ different collection strategies. However, Immix does not address real-time issues.
The main difficulty with real-time GC is that, while first-generation collection is usually very fast, eventually you just have to collect the old generation and you have to do it all at once. Sun's new Garbage-First ("G1") [2] collector therefore tracks pointers between regions, as opposed to just pointers from older two newer generations. This allows collecting regions independently (and in parallel). G1 is still stop-the-world, although marking phase is concurrent. Tracking pointers between all regions can result in quite substantial space overheads, however, so G1 uses some heuristics to discover "popular objects" and treats them specially. In a personal conversation Simon Marlow expressed to me that he intends to go further into this direction, but I don't know how high-priority it is. In general I don't think true real-time is the goal in any case, but rather a general effort to keep GC-pauses short.
Truly concurrent garbage collection is a whole different beast. Concurrent marking can be implemented efficiently with a write barrier. I don't know of any fully concurrent GC scheme that gets by without a read barrier and significant space overhead, however. There are certainly no plans from the GC HQ to implement a fully concurrent GC.
A good summary of concurrent GC techniques used successfully in industry was posted to the GC mailing list recently by Erez Petrank: http://permalink.gmane.org/gmane.comp.programming.garbage-collection.general... Several of those we could do in GHC, for example mostly-concurrent collection uses only a write barrier, and piggybacks on the generational write-barrier we already have, but it does require two stop-the-world phases per GC. I think you'd want to do it in conjunction with Immix-style region sweeping in the old generation, so implementing Immix would be a good first step. Cheers, Simon

On Sun, Feb 28, 2010 at 5:20 AM, Luke Palmer
I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal. Also any ideas for reducing this latency in other ways would be very appreciated.
Since we're talking games here (my profession), I'd point out that it would be cool to be able to supply "hints" to the GC about when might be a good time to do a GC (without unconditionally forcing it), games in particular have some pretty specific properties that may be exploitable. Presumably a non-trivial portion of the objects copied from the nursery/G0 are actually short-lived objects that just happened to have their short life-span overlap with the collection. So really, copying them to the next generation is a "mistake" in some sense. For games, though, we have a very good point that occurs regularly where we know that all/most short-lived objects will no longer be referenced - at the start of a fresh frame. So if we could do as many as possible of our G0 collections at that point we'd avoid "accidental copying" of objects that are actually short-lived into the older generation (which should reduce pressure on that older generation, as well as speed up G0 collection). Ideally we'd have some way of telling the GC to try to avoid running during the actual frame itself, too, by for example tuning the heap region sizes automatically. -- Sebastian Sylvan

On 01/03/2010 17:16, Sebastian Sylvan wrote:
On Sun, Feb 28, 2010 at 5:20 AM, Luke Palmer
mailto:lrpalmer@gmail.com> wrote: I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal. Also any ideas for reducing this latency in other ways would be very appreciated.
Since we're talking games here (my profession), I'd point out that it would be cool to be able to supply "hints" to the GC about when might be a good time to do a GC (without unconditionally forcing it), games in particular have some pretty specific properties that may be exploitable.
Presumably a non-trivial portion of the objects copied from the nursery/G0 are actually short-lived objects that just happened to have their short life-span overlap with the collection. So really, copying them to the next generation is a "mistake" in some sense.
There's a technique we use, that mitigates the effect of this to some extent, called "aging". The idea is that an object is only promoted if it survives at least N GCs in the young generation. Typically N=2 is a good value, so an object that is allocated just before a minor GC will be copied, but probably not promoted unless it survives all the way to the next GC. You can also use fractional values of N and in fact you should measure N in terms of bytes allocated rather than GCs. In GHC 6.12.1 you can tune the amount of aging with +RTS -T<n>, but I removed the option in 6.14 in order to make the GC implementation simpler, we now have a fixed aging -T2 aging policy. In practice other values were very rarely any better, in fact.
For games, though, we have a very good point that occurs regularly where we know that all/most short-lived objects will no longer be referenced - at the start of a fresh frame.
System.Mem.performGC is your friend, but if you're unlucky it might do a major GC and then you'll get more pause than you bargained for. Cheers, Simon

On Tue, Mar 2, 2010 at 7:17 AM, Simon Marlow
For games, though, we have a very good point that occurs regularly where we know that all/most short-lived objects will no longer be referenced - at the start of a fresh frame.
System.Mem.performGC is your friend, but if you're unlucky it might do a major GC and then you'll get more pause than you bargained for.
Some fine-grained control might be nice here. Eg. I could do a major GC as a player is opening a menu, on a loading screen, when the game is paused, or some other key points, and it might still be annoying, but at least it wouldn't interfere with gameplay. There is of course the question of what happens if one of these key points doesn't happen when we need to do an allocation, but... oh well. Perhaps that could be mitigated by saying "I would rather you allocate than major GC right now". Are any of these options impossible, or be unreasonably difficult to implement (I don't suspect so)? Luke

On 02/03/10 20:37, Luke Palmer wrote:
On Tue, Mar 2, 2010 at 7:17 AM, Simon Marlow
wrote: For games, though, we have a very good point that occurs regularly where we know that all/most short-lived objects will no longer be referenced - at the start of a fresh frame.
System.Mem.performGC is your friend, but if you're unlucky it might do a major GC and then you'll get more pause than you bargained for.
Some fine-grained control might be nice here. Eg. I could do a major GC as a player is opening a menu, on a loading screen, when the game is paused, or some other key points, and it might still be annoying, but at least it wouldn't interfere with gameplay. There is of course the question of what happens if one of these key points doesn't happen when we need to do an allocation, but... oh well. Perhaps that could be mitigated by saying "I would rather you allocate than major GC right now". Are any of these options impossible, or be unreasonably difficult to implement (I don't suspect so)?
Actually that's one thing we can do relatively easily, i.e. defer major GC for a while. Due to the way GHC has a two-layer memory manager, the heap is a list of discontiguous blocks, so we can always allocate some more memory. So it would be pretty easy to provide something like disableMajorGC, enableMajorGC :: IO () Of course leaving it disabled too long could be bad, but that's your responsibility. Oh, I just checked and System.Mem.performGC actually performs a major GC, here's its implementation: foreign import ccall {-safe-} "performMajorGC" performGC :: IO () to perform a minor GC (or possibly a major GC if one is due), you want this: foreign import ccall {-safe-} "performGC" performMinorGC :: IO () Cheers, Simon

On 2 Mar 2010, at 21:38, Simon Marlow wrote:
On 02/03/10 20:37, Luke Palmer wrote:
On Tue, Mar 2, 2010 at 7:17 AM, Simon Marlow
wrote: For games, though, we have a very good point that occurs regularly where we know that all/most short-lived objects will no longer be referenced - at the start of a fresh frame.
System.Mem.performGC is your friend, but if you're unlucky it might do a major GC and then you'll get more pause than you bargained for.
Some fine-grained control might be nice here. Eg. I could do a major GC as a player is opening a menu, on a loading screen, when the game is paused, or some other key points, and it might still be annoying, but at least it wouldn't interfere with gameplay. There is of course the question of what happens if one of these key points doesn't happen when we need to do an allocation, but... oh well. Perhaps that could be mitigated by saying "I would rather you allocate than major GC right now". Are any of these options impossible, or be unreasonably difficult to implement (I don't suspect so)?
Actually that's one thing we can do relatively easily, i.e. defer major GC for a while. Due to the way GHC has a two-layer memory manager, the heap is a list of discontiguous blocks, so we can always allocate some more memory.
So it would be pretty easy to provide something like
disableMajorGC, enableMajorGC :: IO ()
Of course leaving it disabled too long could be bad, but that's your responsibility.
Oh, I just checked and System.Mem.performGC actually performs a major GC, here's its implementation:
foreign import ccall {-safe-} "performMajorGC" performGC :: IO ()
to perform a minor GC (or possibly a major GC if one is due), you want this:
foreign import ccall {-safe-} "performGC" performMinorGC :: IO ()
Cheers, Simon _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Is there a similar set of runes to be able to see how much mutation has occurred, how much was live last GC, etc? Neil

On 03/03/2010 08:43, Neil Davies wrote:
On 2 Mar 2010, at 21:38, Simon Marlow wrote:
On 02/03/10 20:37, Luke Palmer wrote:
On Tue, Mar 2, 2010 at 7:17 AM, Simon Marlow
wrote: For games, though, we have a very good point that occurs regularly where we know that all/most short-lived objects will no longer be referenced - at the start of a fresh frame.
System.Mem.performGC is your friend, but if you're unlucky it might do a major GC and then you'll get more pause than you bargained for.
Some fine-grained control might be nice here. Eg. I could do a major GC as a player is opening a menu, on a loading screen, when the game is paused, or some other key points, and it might still be annoying, but at least it wouldn't interfere with gameplay. There is of course the question of what happens if one of these key points doesn't happen when we need to do an allocation, but... oh well. Perhaps that could be mitigated by saying "I would rather you allocate than major GC right now". Are any of these options impossible, or be unreasonably difficult to implement (I don't suspect so)?
Actually that's one thing we can do relatively easily, i.e. defer major GC for a while. Due to the way GHC has a two-layer memory manager, the heap is a list of discontiguous blocks, so we can always allocate some more memory.
So it would be pretty easy to provide something like
disableMajorGC, enableMajorGC :: IO ()
Of course leaving it disabled too long could be bad, but that's your responsibility.
Oh, I just checked and System.Mem.performGC actually performs a major GC, here's its implementation:
foreign import ccall {-safe-} "performMajorGC" performGC :: IO ()
to perform a minor GC (or possibly a major GC if one is due), you want this:
foreign import ccall {-safe-} "performGC" performMinorGC :: IO ()
Cheers, Simon _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Is there a similar set of runes to be able to see how much mutation has occurred, how much was live last GC, etc?
No, but there could be - the information is collected by the RTS, it just isn't made available via a public API right now. Cheers, Simon

Simon Marlow wrote:
So it would be pretty easy to provide something like
disableMajorGC, enableMajorGC :: IO ()
Of course leaving it disabled too long could be bad, but that's your responsibility.
It seems like it'd be preferable to have an interface like: withMajorGCDisabled :: IO() -> IO() or (for some definition of K): withMajorGCDisabled :: (K -> IO()) -> IO() in order to ensure that it always gets turned back on eventually. Of course, the latter can be created from the former pair. It's just that the former reminds me a bit much of explicit memory management and how difficult it is to balance the free()s... -- Live well, ~wren

On 05/03/2010 05:03, wren ng thornton wrote:
Simon Marlow wrote:
So it would be pretty easy to provide something like
disableMajorGC, enableMajorGC :: IO ()
Of course leaving it disabled too long could be bad, but that's your responsibility.
It seems like it'd be preferable to have an interface like:
withMajorGCDisabled :: IO() -> IO()
or (for some definition of K):
withMajorGCDisabled :: (K -> IO()) -> IO()
Sure, my intention was that you'd build this with the primitives.
in order to ensure that it always gets turned back on eventually. Of course, the latter can be created from the former pair. It's just that the former reminds me a bit much of explicit memory management and how difficult it is to balance the free()s...
quite! Cheers, Simon

On 2010-03-02 14:17 +0000 (Tue), Simon Marlow wrote:
System.Mem.performGC is your friend, but if you're unlucky it might do a major GC and then you'll get more pause than you bargained for.
Anybody calling that is a really poor unlucky sod, because, as far as I
can tell from reading the sources, System.Mem.performGC always does a
major GC.
cjs
--
Curt Sampson

A fully concurrent GC running on multiple threads/cores might be great, but I guess this is difficult to implement and introduces a lot of overhead. For simple video games, it might work to always do a full GC per frame, but don't allow it to take more than T milliseconds. In a sense the GC function should be able to be paused and resumed, but it could run on the same thread as the game loop, so no synchronization overhead would arise (in a sense many video games are already little cooperative multitasking systems). The game loop should then decide how to balance the time given to the GC and the memory being collected. This would cause longer frame times and hence sometimes a decrease in frame rate, but never long stalls. Note that the issue also popped up for me in C many years ago. Using Watcom C/C++ in the nineties, I found out that a call to the free function could take a very long time. Also for games that could run many hours or days (e.g. a multi player game server) the C heap could get very fragmented, causing memory allocations and deallocations to take ever more time, sometimes even fail. To make my games run smoothly I had to implement my own memory manager. To make them run for a very long time, I had to implement a memory defragmenter. So in the end, this means a garbage collector :-) Of course this problem is well known, I just wanted to re-state that for games the typical C heap is not very well suitable either.

On Sat, 27 Feb 2010, Luke Palmer wrote:
I have seen some proposals around here for SoC projects and other things to try to improve the latency of GHC's garbage collector. I'm currently developing a game in Haskell, and even 100ms pauses are unacceptable for a real-time game. I'm calling out to people who have seen or made such proposals, because I would be willing to contribute funding and/or mentor a project that would contribute to this goal. Also any ideas for reducing this latency in other ways would be very appreciated.
In my experiments with real-time audio signal processing I could always find a culprit for buffer-underflows other than the garbage collector. Sometimes it was a space leak (e.g. by adding a finalizer to the wrong object), sometimes incorrect handling of time differences, and when working with LLVM it was frequent recompilation of LLVM code.

I am just going to jump on the RT dogpile and mention that I too have wanted RT friendly GC in Haskell. I have attempted on more than one occasion to implement real-time audio applications in Haskell. But I tend to get a lot of buffer underruns, most likely due to GC. For audio apps, there is a callback that happens every few milliseconds. As often as 4ms. The callback needs to finish as quickly as possible to avoid a buffer underruns. So, in theory, I believe I would want garbage collection to only happen in the time periods between when the callback is running. This wouldn't affect the total amount of time that garbage collection ran -- but it would help minimize the amount of time spent in the callback, which should reduce buffer underruns. My feeling right now is that the 'best' solution might be something similar to synthesis OS. I would create a DSL for the realtime DSP code. Using harpy, this DSL would be compiled to assembly with execution time guarantees (as much as can be predicted on modern hardware). For a 'modular synth' this might actually be better than C code, because the effects of certain choices could be 'compiled in' instead of having to check the setting via a compare. (that is what Synthesis OS does). - jeremy

Jeremy Shaw schrieb:
My feeling right now is that the 'best' solution might be something similar to synthesis OS. I would create a DSL for the realtime DSP code. Using harpy, this DSL would be compiled to assembly with execution time guarantees (as much as can be predicted on modern hardware). For a 'modular synth' this might actually be better than C code, because the effects of certain choices could be 'compiled in' instead of having to check the setting via a compare. (that is what Synthesis OS does). I'm currently doing this with LLVM - which is more portable and better optimizing than a plain X86 assembler: http://code.haskell.org/synthesizer/llvm/

On 2010-03-02 11:33 +0900 (Tue), Simon Cranshaw wrote:
I can confirm that without tweaking the RTS settings we were seeing over 100ms GC pauses.
Actually, we can't quite confirm that yet. We're seeing large amounts of time go by in our main trading loop, but I'm still building the profiling tools to see what exactly is going on there. However, GC is high on our list of suspects, since twiddling the GC parameters can improve things drastically. On 2010-03-02 00:06 -0500 (Tue), John Van Enk wrote:
Would a more predictable GC or a faster GC be better in your case? (Of course, both would be nice.)
Well, as on 2010-03-01 17:18 -0600 (Mon), Jeremy Shaw wrote:
For audio apps, there is a callback that happens every few milliseconds. As often as 4ms. The callback needs to finish as quickly as possible to avoid a buffer underruns.
I think we're in about the same position. Ideally we'd never have to stop for GC, but that's obviously not practical; what will hurt pretty badly, and we should be able to prevent, is us gathering up a bunch of market data, making a huge pause for a big GC, and then generating orders based on that now oldish market data. We'd be far better off doing the GC first, and then looking at the state of the market and doing our thing, because though the orders will still not get out as quickly as they would without the GC, at least they'll be using more recent data. I tried invoking System.Mem.performGC at the start of every loop, but that didn't help. Now that I know it was invoking a major GC, I can see why. :-) But really, before I go much further with this: On 2010-03-01 14:41 +0100 (Mon), Peter Verswyvelen wrote:
Sounds like we need to come up with some benchmarking programs so we can measure the GC latency and soft-realtimeness...
Exactly. Right now I'm working from logs made by my own logging and
profiling system. These are timestamped, and they're "good enough"
to get a sense of what's going on, but incomplete. I also have the
information from the new event logging system, which is excellent in
terms of knowing exactly when things are starting and stopping, but
doesn't include information about my program, nor does it include any
sort of GC stats. Then we have the GC statistics we get with -S, which
don't have timestamps.
My plan is to bring all of this together. The first step was to fix
GHC.Exts.traceEvent so that we can use that to report information about
what the application is doing. In 6.12.1 it segfaults, but we have a fix
(see http://hackage.haskell.org/trac/ghc/ticket/3874) and it looks as if
it will go into 6.12.2, even. The next step is to start recording the
information generated by -S in the eventlog as well, so that not only
do we know when a GC started or stopped in relation to our application
code, but we know what generation it was, how big the heap was at the
time, how much was collected, and so on and so forth. Someone mentioned
that there were various other stats that were collected but not printed
by -S; we should probably throw those in too.
With all of that information it should be much easier to figure
out where and when GC behaviour is causing us pain in low-latency
applications.
However: now that Simon's spent a bunch of time experimenting with the
runtime's GC settings and found a set that's mitigated much of our
problem, other things are pushing their way up my priority list. Between
that and an upcoming holiday, I'm probably not going to get back to this
for a few weeks. But I'd be happy to discuss my ideas with anybody else
who's interested in similar things, even if just to know what would be
useful to others.
What do you guys think about setting up a separate mailing list for
this? I have to admit, I don't follow haskell-cafe much due to the high
volume of the list. (Thus my late presence in this thread.) I would be
willing to keep much closer track of a low-volume list that dealt with
only GC stuff.
I'd even be open to providing hosting for the list, using my little baby
mailing list manager written in Haskell (mhailist). It's primitive, but
it does handle subscribing, unsubscribing and forwarding of messages.
cjs
--
Curt Sampson

On 04/03/2010 09:14, Curt Sampson wrote:
However: now that Simon's spent a bunch of time experimenting with the runtime's GC settings and found a set that's mitigated much of our problem, other things are pushing their way up my priority list. Between that and an upcoming holiday, I'm probably not going to get back to this for a few weeks. But I'd be happy to discuss my ideas with anybody else who's interested in similar things, even if just to know what would be useful to others.
What settings are you using, out of interest?
What do you guys think about setting up a separate mailing list for this? I have to admit, I don't follow haskell-cafe much due to the high volume of the list. (Thus my late presence in this thread.) I would be willing to keep much closer track of a low-volume list that dealt with only GC stuff.
Please feel free to discuss on glasgow-haskell-users@ which is lower volume. I don't particularly like setting up new mailing lists unless there's likely to be an ongoing need - we have quite a few defunct lists on haskell.org now.
I'd even be open to providing hosting for the list, using my little baby mailing list manager written in Haskell (mhailist). It's primitive, but it does handle subscribing, unsubscribing and forwarding of messages.
Sure, that would be fine too. Cheers, Simon

For settings we are using -N7 -A8m -qg.
I don't know if they are really the optimal values but I haven't found a
significant improvement on these yet. I tried -qb but that was slow. I
tried larger values of A but that didn't seem to make a big difference. Also
-N6 didn't make much difference. Specifying H values didn't seem to make
much difference. I have to admit I don't fully understand the implications
of the values and was just experimenting to see what worked best. Please
let me know if there's anything you suggest I should try. That said, we
have already got it to a speed where it's working so I'm ok with what we
now.
On Fri, Mar 5, 2010 at 9:38 PM, Simon Marlow
What settings are you using, out of interest?

On 06/03/10 06:56, Simon Cranshaw wrote:
For settings we are using -N7 -A8m -qg.
I'm surprised if turning off parallel GC improves things, unless you really aren't using all the cores (ThreadScope will tell you that). Do these flags give you an improvement in throughput, or just pause times?
I don't know if they are really the optimal values but I haven't found a significant improvement on these yet. I tried -qb but that was slow.
Interesting, I often find that -qb improves things.
I tried larger values of A but that didn't seem to make a big difference.
-A8m is close to the size of your L2 caches, right? That will certainly be better than the default of -A512k.
Also -N6 didn't make much difference. Specifying H values didn't seem to make much difference.
-H is certainly a mixed bag when it comes to parallel programs.
I have to admit I don't fully understand the implications of the values and was just experimenting to see what worked best.
So the heap size is trading off locality (cache hits) against GC time. The larger the heap, the fewer GCs you do, but the worse the locality. Usually I find keeping the nursery size (-A) close to the L2 cache size works best, although sometimes making it really big can be even better. -qg disables parallel GC completely. This is usually terrible for locality, because every GC will move all the recently allocated data from each CPU's L2 cache into the cache of the CPU doing GC, where it will have to be fetched out again after GC. -qb disables load-balancing in the parallel GC, which improves locality at the expense of parallelism, usually I find it is an improvement in parallel programs. Cheers, Simon

On 2010-03-06 12:42 +0000 (Sat), Simon Marlow wrote:
Usually I find keeping the nursery size (-A) close to the L2 cache size works best, although sometimes making it really big can be even better.
Interesting to know. I got the impression that I was being encouraged to keep -A closer to the L1 cache size, myself.
-qg disables parallel GC completely. This is usually terrible for locality, because every GC will move all the recently allocated data from each CPU's L2 cache into the cache of the CPU doing GC, where it will have to be fetched out again after GC.
I've since explained to Cranshaw (we are getting to have *way* too many 'Simon's around here) about the issues with our different machines; some of this depends on the host on which we're doing the testing. * Our Core i7 hosts share 8 MB of L3 cache amongst four cores with two threads each. Thus, no locality penalties here. * Our Xeon E5420 host has two 4-core CPUs, and each pair of cores shares a 6 MB L2 cache. Thus there's a pretty good chance that something you need is in someone else's cache. I don't know if there's any difference between moving stuff between two caches on the same CPU and two caches on different CPUs. * Our Xeon E5520 host has two 4-core CPUs, each core of which has two threads. Each CPU (4 cores) shares an 8 MB L3 cache. Thus, presumably, less locality penalty than the E5420 but more than an i7. As a side note, I also see slightly less memory bandwidth on this system (for both caches and main memory) than I do on an i7. This gets complex pretty fast. And don't ask me about Intel's new style of having L1 and L3 or L2 and L3 caches rather than L1 and L2 caches.
-qb disables load-balancing in the parallel GC, which improves locality at the expense of parallelism, usually I find it is an improvement in parallel programs.
I'd think so too. Figuring out what went on here is going to have to
wait until I get more detailed GC information in the eventlog.
Followups to glasgow-haskell-users@haskell.org.
cjs
--
Curt Sampson
participants (23)
-
Andrew Coppin
-
Curt Sampson
-
Derek Elkins
-
Don Stewart
-
Heinrich Apfelmus
-
Henning Thielemann
-
Henning Thielemann
-
Jason Dagit
-
Jason Dusek
-
Jeremy Shaw
-
Job Vranish
-
John Van Enk
-
Luke Palmer
-
Malcolm Wallace
-
Neil Davies
-
Pavel Perikov
-
Peter Verswyvelen
-
Sebastian Sylvan
-
Simon Cranshaw
-
Simon Marlow
-
Sönke Hahn
-
Thomas Schilling
-
wren ng thornton