
I've been trying to profile my program and having some trouble. I imagine other people have stumbled across these same issues, so hopefully I can benefit from someone else's experience. In return, if I can figure out some of these things, I'll put it up at http://www.haskell.org/haskellwiki/How_to_profile_a_Haskell_program or maybe make make a heap profiling page linked from that. Firstly a few miscellaneous questions: When running '-hc -hblag,drag' it works for a little while and then stops. The app is still hung, but cpu has gone to 0%. The disk is also idle, so I don't think it's swapping. According to -S, all garbage collection has stopped too. It's apparently due to something about this particular profile, since reducing the amount of data it handles just results in a sooner hang. This same combination works with other profiles, so apparently something the code is doing is locking up. Has anyone else seen this? Any tips on how to troubleshoot where it's getting stuck, doing what? If it sounds like a ghc bug I can try to trim down the size and submit a ticket. GHC 6.12.1 on OSX. The image link from http://www.haskell.org/ghc/docs/6.12.2/html/users_guide/profiling.html is broken, which makes it a little harder to understand the documentation. -s stats say GC time is 46%, productivity is 32%. That's pretty bad, right? And where is the remaining 22% going? The ghc manual says lag is "time between creation and first use" and drag is "time between last use and last reference is dropped". I think I understand drag: it means you haven't touched the data for a while before it goes out of scope or its retainer goes out of scope. So for instance: stuff = do st <- get x1 st x2 Does this mean that 'st' will be dragging through 'x2' as it would in an imperative language? I gather the usual haskell use is when combined with laziness: 'Map.insertWith (+) "a" 42' will keep the old value of "a" around until you look "a" up and force it, since the (+) won't be applied until then. Same with foldl. Are there some other classic examples of drag? Searching for "haskell dragging problem" doesn't yield much to do with memory use. Lag I'm not so sure about. How is something created before it's used? And... what's INHERENT_USE? And what about VOID? How can an object be created but never used? What triggered its creation? So, the main question: I have a program that runs some computation in a monad stack before extracting the final result, a list, and passing it on. When run under the heap profiler, there's a huge spike in that phase, which I think should be mostly bogus usage, since the final output is so relatively small. When I run under -hb I see big bands for LAG and DRAG. According to -hd the top 3 users are: mtl-1.1.0.2:Control.Monad.Writer.Lazy.sat_sltc (,) D:Monad This is kind of puzzling to me... first of all I've never seen an explanation for sat_* closure descriptors, and second of all, why does it not show up in the .prof file at all? I switched to Writer.Strict and the drag disappeared, which helped, but the lag is still there, and the top -hd is now mtl-1.1.0.2:Control.Monad.Writer.Strict.sat_soa1 State stg_ap_2_upd_info (the top -hy is "*", which I gather means "don't know"). And what's "stg_ap_2_upd_info"? The top item accounts for 70% of the memory usage. One obvious candidate for the lag is Writer's data (DList Log.Msg) is collecting and only being forced at the end of the computation, but there is no logging in this section and in any case should never be 30M of it! -hc is not helpful since every monadic operation is charged a little bit, -hr is similarly unhelpful (top retainer is MANY... hah). So what exactly is this sat_*? Where is the memory going? I guess it doesn't have an SCC since it doesn't show up in the .prof output. Is there some place I can manually put an SCC? I was able to fix the drag just by guessing at a strict writer, but the lag is still around. Is there another spot in Writer's >>= that could be accumulating? What's *in* that giant mountain of lag?

On 16/06/2010 01:52, Evan Laforge wrote:
I've been trying to profile my program and having some trouble. I imagine other people have stumbled across these same issues, so hopefully I can benefit from someone else's experience. In return, if I can figure out some of these things, I'll put it up at http://www.haskell.org/haskellwiki/How_to_profile_a_Haskell_program or maybe make make a heap profiling page linked from that.
Firstly a few miscellaneous questions:
When running '-hc -hblag,drag' it works for a little while and then stops. The app is still hung, but cpu has gone to 0%. The disk is also idle, so I don't think it's swapping. According to -S, all garbage collection has stopped too. It's apparently due to something about this particular profile, since reducing the amount of data it handles just results in a sooner hang. This same combination works with other profiles, so apparently something the code is doing is locking up. Has anyone else seen this? Any tips on how to troubleshoot where it's getting stuck, doing what? If it sounds like a ghc bug I can try to trim down the size and submit a ticket. GHC 6.12.1 on OSX.
Please submit a ticket, and try 6.12.3 if you can (we did fix some deadlock bugs in 6.12.2 and 6.12.3).
The image link from http://www.haskell.org/ghc/docs/6.12.2/html/users_guide/profiling.html is broken, which makes it a little harder to understand the documentation.
-s stats say GC time is 46%, productivity is 32%. That's pretty bad, right? And where is the remaining 22% going?
The ghc manual says lag is "time between creation and first use" and drag is "time between last use and last reference is dropped". I think I understand drag: it means you haven't touched the data for a while before it goes out of scope or its retainer goes out of scope. So for instance:
stuff = do st<- get x1 st x2
Does this mean that 'st' will be dragging through 'x2' as it would in an imperative language?
Probably not. It depends on what x1 does with st of course, but assuming when x1 returns all references to st have been dropped, then st is no longer reachable and will not be retained by the GC.
I gather the usual haskell use is when combined with laziness: 'Map.insertWith (+) "a" 42' will keep the old value of "a" around until you look "a" up and force it, since the (+) won't be applied until then. Same with foldl. Are there some other classic examples of drag? Searching for "haskell dragging problem" doesn't yield much to do with memory use.
The original paper about this should help: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.1219
Lag I'm not so sure about. How is something created before it's used?
oh, that happens a lot. e.g. in f (g x), if f doesn't demand the value of its argument for a long time, then the heap closure for (g x) is lagging. If the value of (g x) is never demanded, but is nevertheless stored in some data structure for a while, then it is VOID.
And... what's INHERENT_USE?
Primitive objects like arrays and mutable variables, where we don't record the use time so we consider them to be implicitly used.
So, the main question:
I have a program that runs some computation in a monad stack before extracting the final result, a list, and passing it on. When run under the heap profiler, there's a huge spike in that phase, which I think should be mostly bogus usage, since the final output is so relatively small. When I run under -hb I see big bands for LAG and DRAG.
According to -hd the top 3 users are:
mtl-1.1.0.2:Control.Monad.Writer.Lazy.sat_sltc (,) D:Monad
This is kind of puzzling to me... first of all I've never seen an explanation for sat_* closure descriptors, and second of all, why does it not show up in the .prof file at all?
sat_sltc is just a compiler-generated name for a thunk (a suspended computation). It doesn't show up in the .prof file because it isn't a cost centre, it's the name for a closure and is only shown by -hd.
I switched to Writer.Strict and the drag disappeared, which helped, but the lag is still there, and the top -hd is now
mtl-1.1.0.2:Control.Monad.Writer.Strict.sat_soa1 State stg_ap_2_upd_info
The Monad instance for Writer looks like this: instance (Monoid w) => Monad (Writer w) where return a = Writer (a, mempty) m >>= k = Writer $ case runWriter m of (a, w) -> case runWriter (k a) of (b, w') -> (b, w `mappend` w') I expect sat_s0a1 is the closure for (w `mappend` w'). If that is causing your space leak, then maybe you need a Control.Monad.Writer.Stricter in which the written value is forced strictly by >>=.
(the top -hy is "*", which I gather means "don't know"). And what's "stg_ap_2_upd_info"? The top item accounts for 70% of the memory usage.
You see "*" when the type of a closure is polymorphic (as it would be in the case of w `mappend` w'). stg_ap_2_upd_info is a generic thunk implementation in the RTS - perhaps when profiling we should avoid using these to give you more information. Cheers, Simon
One obvious candidate for the lag is Writer's data (DList Log.Msg) is collecting and only being forced at the end of the computation, but there is no logging in this section and in any case should never be 30M of it! -hc is not helpful since every monadic operation is charged a little bit, -hr is similarly unhelpful (top retainer is MANY... hah).
So what exactly is this sat_*? Where is the memory going? I guess it doesn't have an SCC since it doesn't show up in the .prof output. Is there some place I can manually put an SCC? I was able to fix the drag just by guessing at a strict writer, but the lag is still around. Is there another spot in Writer's>>= that could be accumulating? What's *in* that giant mountain of lag? _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

locking up. Has anyone else seen this? Any tips on how to troubleshoot where it's getting stuck, doing what? If it sounds like a ghc bug I can try to trim down the size and submit a ticket. GHC 6.12.1 on OSX.
Please submit a ticket, and try 6.12.3 if you can (we did fix some deadlock bugs in 6.12.2 and 6.12.3).
Ok, looks like it's still happening, so I suppose a ticket will be in order. I'm going to try to cut it down a little first though. At the least it's 100% reproduceable. I can try to get a stack trace with gdb or the OS X process tracing feature. In once instance I got a crash with no message (not even segfault or sigabrt). OS X traceback says: Thread 2 Crashed: 0 elaforge.seq.seq 0x02a2d74d LDV_recordDead + 301 The profiling files (.prof, .hp, etc.) are all empty.
Does this mean that 'st' will be dragging through 'x2' as it would in an imperative language?
Probably not. It depends on what x1 does with st of course, but assuming when x1 returns all references to st have been dropped, then st is no longer reachable and will not be retained by the GC.
I suppose static analysis can figure out pretty easily that 'st' is not referenced even though it's in scope, so the closure doesn't count as a retainer.
The original paper about this should help:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.1219
Thanks, just read it now. I still don't have an intuitive feel for lag and drag, but sometimes I need to read something a few times for it to sink in.
Lag I'm not so sure about. How is something created before it's used?
oh, that happens a lot. e.g. in f (g x), if f doesn't demand the value of its argument for a long time, then the heap closure for (g x) is lagging. If the value of (g x) is never demanded, but is nevertheless stored in some data structure for a while, then it is VOID.
Ok... so a (g x) closure is constructed, and it will keep 'x' alive when it shouldn't, but what's the actual data constructor here? If 'x' is [1, 2], will these two cons cells be counted as being in LAG stage? Or is it just the closure, which will show up as a sat_something?
The Monad instance for Writer looks like this:
instance (Monoid w) => Monad (Writer w) where return a = Writer (a, mempty) m >>= k = Writer $ case runWriter m of (a, w) -> case runWriter (k a) of (b, w') -> (b, w `mappend` w')
I expect sat_s0a1 is the closure for (w `mappend` w'). If that is causing your space leak, then maybe you need a Control.Monad.Writer.Stricter in which the written value is forced strictly by >>=.
So I copied Writer/Strict.hs and made a new one with a slightly stricter >>=, note the ! on w': m >>= k = WriterT $ do (a, w) <- runWriterT m (b, !w') <- runWriterT (k a) let ws = w `mappend` w' return (b, ws) Initially this had no effect. I put a ! on the first 'w', and now the StricterWriter.sat_* closure is replaced by the #2 consumer, which is PAP. What *is* PAP? Memory usage is exactly the same, I just have more PAP now. So I tried switching from DList and DList.append to [] and (++) and now... the lag is gone! Only the ! on w' is needed. The quadratic behaviour of (++) is not a practical worry in my case, since I log rarely so most of the appends are with []s on one side or both, but this is a bit troubling. Writer + DList was specifically recommended as an efficient way to log things, but if a giant space leak is the result, then it's not such a great idea after all. I'm guessing it has something to do with how DList is implemented in terms of function compositions, but since I can't see inside PAP I have to guess. 'xs ++ []', when forced, can be immediately reduced to 'xs'. The bang forces the right hand side to either [] or (_ : _) which is enough for (++) to reduce out of existence... I suppose? Meanwhile, 'append' for DList is (xs . ys) which becomes ((xs++) . ([]++)), and forcing that... well I suppose it has to get to a constructor to pattern match on, so that should force the ([]++) out of existence since it reduces to the same code as in the plain list part... or does it? To try to get a better picture of what was going on, I created a simpler profile with a simplified monad stack: just logging and state and doing some stylized actions... and now the results are the reverse! The stricter version on lists produces an enormous amount of lag, the stock Writer.Strict is much better, but still produces quite a bit of garbage (perhaps from the State, though I can't see what I'm doing wrong there). The dlist + normal strictness version is so fast the gc can't even measure it! But if I bump the 'depth' from 9 to 10 (which means factorially more calls), suddenly I'm getting stack overflows... but initially only in hb mode. So the more I look at it, the less I understand. Clearly something is different from my production profile and the test profile, but it seems baffling that it makes such a difference as to reverse the results. I will try to reduce the production version element by element until the results switch around. I put the simple version at http://hpaste.org/fastcgi/hpaste.fcgi/view?id=26329#a26329 This one displays much better performance with DList + Writer.Strict than List + StrictWriter so I guess it's not too surprising. However, *something* is still generating a lot of lag and -hb causes a stack trace, so I'm still doing it wrong somehow. I also have a situation where -hb shows about 4mb of drag at the most, but when I run with '-hc -hbdrag', I only see 10k peaks. Shouldn't filtering the graph by drag add up to as much drag as when the graph isn't filtered?
You see "*" when the type of a closure is polymorphic (as it would be in the case of w `mappend` w'). stg_ap_2_upd_info is a generic thunk implementation in the RTS - perhaps when profiling we should avoid using these to give you more information.
Well, it's eating a lot of space... so it must be something. I guess if it's impossible to tell what it is, then it's just being needlessly confusing there. And thanks so much for the responses... space usage gets more confusing the more I look at it, but I hope it will come clear eventually given enough time.

On 18/06/2010 08:29, Evan Laforge wrote:
oh, that happens a lot. e.g. in f (g x), if f doesn't demand the value of its argument for a long time, then the heap closure for (g x) is lagging. If the value of (g x) is never demanded, but is nevertheless stored in some data structure for a while, then it is VOID.
Ok... so a (g x) closure is constructed, and it will keep 'x' alive when it shouldn't, but what's the actual data constructor here? If 'x' is [1, 2], will these two cons cells be counted as being in LAG stage? Or is it just the closure, which will show up as a sat_something?
I think it depends whether the [1,2] has been "used" or not. If it has been used, then the closures are in the USE state until the last use, otherwise they are in the LAG state.
The Monad instance for Writer looks like this:
instance (Monoid w) => Monad (Writer w) where return a = Writer (a, mempty) m>>= k = Writer $ case runWriter m of (a, w) -> case runWriter (k a) of (b, w') -> (b, w `mappend` w')
I expect sat_s0a1 is the closure for (w `mappend` w'). If that is causing your space leak, then maybe you need a Control.Monad.Writer.Stricter in which the written value is forced strictly by>>=.
So I copied Writer/Strict.hs and made a new one with a slightly stricter>>=, note the ! on w':
m>>= k = WriterT $ do (a, w)<- runWriterT m (b, !w')<- runWriterT (k a) let ws = w `mappend` w' return (b, ws)
Initially this had no effect. I put a ! on the first 'w', and now the StricterWriter.sat_* closure is replaced by the #2 consumer, which is PAP. What *is* PAP? Memory usage is exactly the same, I just have more PAP now. So I tried switching from DList and DList.append to [] and (++) and now... the lag is gone! Only the ! on w' is needed.
The quadratic behaviour of (++) is not a practical worry in my case, since I log rarely so most of the appends are with []s on one side or both, but this is a bit troubling. Writer + DList was specifically recommended as an efficient way to log things, but if a giant space leak is the result, then it's not such a great idea after all. I'm guessing it has something to do with how DList is implemented in terms of function compositions, but since I can't see inside PAP I have to guess.
Right, I wouldn't use DList for this. Here's an alternative I use: data AList a = ANil | ASing a | Append (AList a) (AList a) lenA :: AList a -> Int lenA ANil = 0 lenA (ASing _) = 1 lenA (Append l r) = lenA l + lenA r appendA ANil r = r appendA l ANil = l appendA l r = Append l r Note how appendA is strict(ish) and eliminates ANils, so in a writer monad it shouldn't build up a space leak. I'm sure you can write toList (don't use (++) though). I'd put the bang on ws: m>>= k = WriterT $ do (a, w) <- runWriterT m (b, w') <- runWriterT (k a) let !ws = w `mappend` w' return (b, ws) The problem with this monad is that >>= isn't tail-recursive, so it will cause stack overflows on recursive monadic functions. I suspect that a better alternative to the strict writer monad is the strict state monad in most cases, because its bind is tail-recursive.
To try to get a better picture of what was going on, I created a simpler profile with a simplified monad stack: just logging and state and doing some stylized actions... and now the results are the reverse! The stricter version on lists produces an enormous amount of lag, the stock Writer.Strict is much better, but still produces quite a bit of garbage (perhaps from the State, though I can't see what I'm doing wrong there). The dlist + normal strictness version is so fast the gc can't even measure it! But if I bump the 'depth' from 9 to 10 (which means factorially more calls), suddenly I'm getting stack overflows... but initially only in hb mode.
So the more I look at it, the less I understand. Clearly something is different from my production profile and the test profile, but it seems baffling that it makes such a difference as to reverse the results. I will try to reduce the production version element by element until the results switch around.
I put the simple version at http://hpaste.org/fastcgi/hpaste.fcgi/view?id=26329#a26329
This one displays much better performance with DList + Writer.Strict than List + StrictWriter so I guess it's not too surprising. However, *something* is still generating a lot of lag and -hb causes a stack trace, so I'm still doing it wrong somehow.
I also have a situation where -hb shows about 4mb of drag at the most, but when I run with '-hc -hbdrag', I only see 10k peaks. Shouldn't filtering the graph by drag add up to as much drag as when the graph isn't filtered?
That sounds suspicious. If you can make a self-contained example that demonstrates it and create a ticket, that would be a great help. Cheers, Simon
You see "*" when the type of a closure is polymorphic (as it would be in the case of w `mappend` w'). stg_ap_2_upd_info is a generic thunk implementation in the RTS - perhaps when profiling we should avoid using these to give you more information.
Well, it's eating a lot of space... so it must be something. I guess if it's impossible to tell what it is, then it's just being needlessly confusing there.
And thanks so much for the responses... space usage gets more confusing the more I look at it, but I hope it will come clear eventually given enough time.

Right, I wouldn't use DList for this. Here's an alternative I use:
data AList a = ANil | ASing a | Append (AList a) (AList a)
lenA :: AList a -> Int lenA ANil = 0 lenA (ASing _) = 1 lenA (Append l r) = lenA l + lenA r
appendA ANil r = r appendA l ANil = l appendA l r = Append l r
Note how appendA is strict(ish) and eliminates ANils, so in a writer monad it shouldn't build up a space leak. I'm sure you can write toList (don't use (++) though).
I hope you're not overestimating me :) I thrashed around for a good long time trying to get (++) to associate to the right, and then stumbled across OrdList in ghc, which is apparently just this data structure, with the addition of 'Many [a]'.
I'd put the bang on ws:
m>>= k = WriterT $ do (a, w) <- runWriterT m (b, w') <- runWriterT (k a) let !ws = w `mappend` w' return (b, ws)
The problem with this monad is that >>= isn't tail-recursive, so it will cause stack overflows on recursive monadic functions. I suspect that a better alternative to the strict writer monad is the strict state monad in most cases, because its bind is tail-recursive.
Bang on 'ws' was my first instinct too, but it appeared to have no effect. I tried to work out a manual reduction for >>= to figure out exactly what is being forced when, but I got a headache and decided to do something more relaxing, like watch Primer. Yes, I suppose Writer's >>= isn't tail recursive... which seems like a disaster for >>=, since they are usually composed into very long chains. I know non-tail recursiveness isn't necessarily a disaster in haskell because of laziness, but it seems like you need just one strict monad in the stack, like ErrorT or IO and you are no longer lazy. I implemented LoggerT as a strict state monad, and I think it got better, but there's still a lot of garbage coming from somewhere. AppendList actually seems slower in the face of repeated appends than (:) followed by reverse, as I mentioned in the other email. Or it could be I'm not measuring things right...
I also have a situation where -hb shows about 4mb of drag at the most, but when I run with '-hc -hbdrag', I only see 10k peaks. Shouldn't filtering the graph by drag add up to as much drag as when the graph isn't filtered?
That sounds suspicious. If you can make a self-contained example that demonstrates it and create a ticket, that would be a great help.
Ok, I'll try to reduce this to a manageable size. It's the same code that produces the deadlock while profiling, so perhaps that's related. thanks again for the advice!

I put the simple version at http://hpaste.org/fastcgi/hpaste.fcgi/view?id=26329#a26329
This one displays much better performance with DList + Writer.Strict than List + StrictWriter so I guess it's not too surprising. However, *something* is still generating a lot of lag
Some lag is to be expected: logging is about creating and recording data for later use. You'd need to annotate your code to measure which parts of lag are not attributable to the log (for instance, your "m" is passed around a lot before it gets used). Then you can use "-hblag -hc -L40" to get lag attributed to cost centers. Btw, you probably want to be more strict about things like that sum in the simple version ("return $! (1+x1+..)"), even a stricter base case ("return $! 1") helps a little, suggesting that you might want to think about the strictness of your logging tools before you put that in production. Hth, Claus

On Sat, Jun 19, 2010 at 8:46 AM, Claus Reinke
I put the simple version at http://hpaste.org/fastcgi/hpaste.fcgi/view?id=26329#a26329
This one displays much better performance with DList + Writer.Strict than List + StrictWriter so I guess it's not too surprising. However, *something* is still generating a lot of lag
Some lag is to be expected: logging is about creating and recording data for later use. You'd need to annotate
Indeed, though I don't think this is the case, because I get lots of lag even when no logs are written. I solved the stack overflow problem with state by replacing 'modify' with a stricter '!x <- get; put $! (x+1)'. However, the results from profiling logging continue to be mysterious. The fastest is using a strict StateT, prepending with 'modify (msg:)' and reversing the list on run. Oddly, making *this* modify strict results in a near 2x slowdown, perhaps due to a 27.7% productivity instead of a 31.8% productivity (though both are terrible, and I still don't know why all the garbage is being produced). Using Simon's AppendList (my implementation is based on OrdList from ghc), it's slightly slower than the strict modify version. Apparently reversing a 4194304 element list is still faster than using a custom datatype. My conclusion is that my contrived tests are not useful because the whole point is to get a controlled situation by simplifying it, but even the simple version displays mysterious behaviour. I should go back to poking at the production version. Even if I don't understand the changes I'm making, at least when I get a speedup I'm more sure it translates to a real speed up in the application.

On Sat, Jun 19, 2010 at 8:46 AM, Claus Reinke
wrote: I put the simple version at http://hpaste.org/fastcgi/hpaste.fcgi/view?id=26329#a26329
This one displays much better performance with DList + Writer.Strict than List + StrictWriter so I guess it's not too surprising. However, *something* is still generating a lot of lag
Some lag is to be expected: logging is about creating and recording data for later use. You'd need to annotate
Indeed, though I don't think this is the case, because I get lots of lag even when no logs are written.
In the part you deleted I mentioned one source of lag that does not disappear when no logs are written, and a way of using profiling cost centers to track down other sources (the ones I mentioned accounted for 2/3 of lag in the "profile_logging" profile, according to "-hblag -hc -L40" on SCC-annotated source).
My conclusion is that my contrived tests are not useful because the whole point is to get a controlled situation by simplifying it, but even the simple version displays mysterious behaviour.
It is sometimes difficult to produce a simplified version that has strictly fewer problems than the full version, especially if the simple version is not derived from the full version but from simplifying assumptions. Nevertheless, the simpler version has less code, and it is therefore helpful to try and understand its behaviour before returning to full complexity. In particular, it is important to figure out which problematic aspects of the simple version also apply to the full version and which problems can be ignored because they cannot occur in the full version.
I should go back to poking at the production version. Even if I don't understand the changes I'm making, at least when I get a speedup I'm more sure it translates to a real speed up in the application.
That way lies madness. Not recommended unless you are producing code that noone is ever going to look at again, and even then you'll be writing similar code in the future, so it is better to put in the energy once to puzzle out those mysterious details!-) Claus

Indeed, though I don't think this is the case, because I get lots of lag even when no logs are written.
In the part you deleted I mentioned one source of lag that does not disappear when no logs are written, and a way of using profiling cost centers to track down other sources (the ones I mentioned accounted for 2/3 of lag in the "profile_logging" profile, according to "-hblag -hc -L40" on SCC-annotated source).
You mean !s on the intermediate numbers? I could have sworn I tried that but no luck. Thanks for reminding me about manual SCC pragmas, somehow I totally forgot you could add your own. Just out of curiosity, what affect could "return $! 1" have? A constant should never be a thunk, so 'seq' on it should have no effect, right?
I should go back to poking at the production version. Even if I don't understand the changes I'm making, at least when I get a speedup I'm more sure it translates to a real speed up in the application.
That way lies madness. Not recommended unless you are producing code that noone is ever going to look at again, and even then you'll be writing similar code in the future, so it is better to put in the energy once to puzzle out those mysterious details!-)
Yes, you are right of course, and thank you for the help. It's just frustrating when every step toward simpler instead brings new problems out of the woodwork... but I suppose with enough experience I can begin to understand those too. I will resume my testing when I get some time again!

Indeed, though I don't think this is the case, because I get lots of lag even when no logs are written. In the part you deleted I mentioned one source of lag that does not disappear when no logs are written, and a way of using profiling cost centers to track down other sources (the ones I mentioned accounted for 2/3 of lag in the "profile_logging" profile, according to "-hblag -hc -L40" on SCC-annotated source).
You mean !s on the intermediate numbers?
No, I wrote about your 'm' parameter to 'runprof', which gets passed all the way through the recursion tree of run_lots before it finally gets used at the leaf level. If I recall correctly, I put one SCC on the log constructors and another on the actual parameter passed to runprof. The comments on strictness were less about lag than about total memory usage - unevaluated computations piling up in logs or recursion parameters are a frequent source of out of memory issues (from dramatic slowdown to termination). Even for your simple test, that changes the profile (the graphic is scaled, but the numbers matter).
I could have sworn I tried that but no luck. Thanks for reminding me about manual SCC pragmas, somehow I totally forgot you could add your own. Just out of curiosity, what affect could "return $! 1" have? A constant should never be a thunk, so 'seq' on it should have no effect, right?
I was surprised by that myself. In principle, numeric constants translate to calls to fromInteger (Haskell 98 report, 3.2), and can return anything with a Num instance (such as the literal '1' translating into an infinite list of ones), so they can't simply be pre-evaluated. But you did specify the type to be Int, so the compiler could have avoided the thunk (one would need to look at GHC's output to check). Then again, we were looking at lag, not at heap size, and those 1s (small, but lots of them) were built before they are used.
Yes, you are right of course, and thank you for the help. It's just frustrating when every step toward simpler instead brings new problems out of the woodwork... but I suppose with enough experience I can begin to understand those too. I will resume my testing when I get some time again!
Understood. You don't have to understand everything at once, and simplified test setups often have issues that don't occur in the real application. But analyzing real applications needs some intuition about what might be going on, and that is easier to acquire in simple test setups. There, you can apply the old Holmes maxime of debugging: once you've excluded everything that can't be the problem, what is left, however unlikely or hard to understand, has to be it (with apologies to Doyle;-). Claus

On Wed, Jun 16, 2010 at 2:52 AM, Evan Laforge
-s stats say GC time is 46%, productivity is 32%. That's pretty bad, right? And where is the remaining 22% going?
That is indeed awfully bad. First, you should look out for the case where you create a lot of data and then quickly consume it afterwards, because it increases GC pressure. You are spending 2/3 of your time doing something else than running the mutator (i.e., the program). For biographical profiling, it might take some time in the profiling areas, so look at the counts in -s which tells you the elapsed time in the profiling innards. You should take a look at the retainer profiling as a common problem I've run into many times is lazy thunks retaining a lot of memory. If it is retainers, you either need to make your code strict or annotate it as such. Finally, you can tune the eden/nursery generation by providing +RTS -A5m for instance. The default is 0.5 megabytes which I think is way too low for most work. Increasing it blindly will hurt though as you will hit L1/L2 cache limits and get worse performance. My application has had problems with both retainer profiling and (especially) biographical profiling - but whenever I try to make a watered-down reproducible example the problem disappears. If you do manage the get a reproducible bug going, I am interested in that ticket.
So, the main question:
I have a program that runs some computation in a monad stack before extracting the final result, a list, and passing it on. When run under the heap profiler, there's a huge spike in that phase, which I think should be mostly bogus usage, since the final output is so relatively small. When I run under -hb I see big bands for LAG and DRAG.
I would look at retainers. Or if you hold on to things you don't use anymore rather than dropping the reference. In general, the heap profiler is the way to go for removing space leak problems - -- J.

On 16/06/2010 14:04, Jesper Louis Andersen wrote:
Finally, you can tune the eden/nursery generation by providing +RTS -A5m for instance. The default is 0.5 megabytes which I think is way too low for most work. Increasing it blindly will hurt though as you will hit L1/L2 cache limits and get worse performance.
Not sure I follow here - you say that 0.5MB is too low, and yet increasing it will hurt performance (I agree with you on the last point: in my experience most of the time increasing it does tend to make things worse). OTOH there are programs where the 0.5MB default really hurts - the binary-trees benchmark on the shootout, for example. If we could identify a useful set of heuristics to decide when it's a good idea to increase the nursery size, that would help. I think the only way to get a real handle on this is to put together a comprehensive benchmark suite and test lots of different programs with different heap requirements, and on lots of different machines with different cache sizes. Incedentally, GHC HEAD has a new flag +RTS -H, which increases the nursery size but without increasing the overall memory use (basically it uses the space left over after a copying major collection for the nursery). This sometimes helps with programs that have large heaps, but not always. Cheers, Simon
My application has had problems with both retainer profiling and (especially) biographical profiling - but whenever I try to make a watered-down reproducible example the problem disappears. If you do manage the get a reproducible bug going, I am interested in that ticket.
So, the main question:
I have a program that runs some computation in a monad stack before extracting the final result, a list, and passing it on. When run under the heap profiler, there's a huge spike in that phase, which I think should be mostly bogus usage, since the final output is so relatively small. When I run under -hb I see big bands for LAG and DRAG.
I would look at retainers. Or if you hold on to things you don't use anymore rather than dropping the reference. In general, the heap profiler is the way to go for removing space leak problems -
participants (4)
-
Claus Reinke
-
Evan Laforge
-
Jesper Louis Andersen
-
Simon Marlow