I am having difficulty imagining how there could be a loop that evaluates a black hole a second time by the same thread. The paper Haskell on a Shared-Memory Processor mentions that the runtime must not do this. My take on this description is that it should never happen. Even if code was recursive, it should not happen. Code might recurse forever and chew up memory until failure, but a thread creating a black hold should not trip up on its own black hole.

So I am not sure what I am looking for. There is no recursion in the encode.

Can you give some examples of what I should be looking for?

Mike


On Nov 11, 2014, at 8:37 PM, John Lato <jwlato@gmail.com> wrote:

The blocked on black hole message is very suspicious.  It means that thread 7 is blocked waiting for another thread to evaluate a thunk.  But in this case, it's thread 7 that created that thunk and is supposed to be doing the evaluating.  This is some evidence that Gregory's theory is correct and your encode function loops somewhere.

On Wed Nov 12 2014 at 11:25:30 AM Michael Jones <mike@proclivis.com> wrote:
Gregory,

The options in the 7.8.3 user guide says in the -Msize option that by default the heap is unlimited. I have several applications, and they all have messages like:

7fddc7bcd700: cap 2: waking up thread 7 on cap 2
7fddc7bcd700: cap 2: thread 4 stopped (yielding)
7fddcaad6740: cap 2: running thread 7 (ThreadRunGHC)
7fddcaad6740: cap 2: thread 7 stopped (heap overflow)
7fddcaad6740: cap 2: requesting parallel GC
7fddc5ffe700: cap 0: starting GC
7fddc57fd700: cap 1: starting GC
7fdda77fe700: cap 3: starting GC
7fddcaad6740: cap 2: starting GC

I assumed that when the heap ran out of space, it caused a GC, or it enlarged the heap. The programs that have these messages run for very long periods of time, and when I heap profile them, they use about 500KM to 1MB over long periods of time, and are quite stable.

As a test, I ran the hang application with profiling to see if memory jumps up before or after the hang.

What I notice is the app moves along using about 800KB, then there is a spike to 2MB at the hang. So I believe you, but I am confused about the RTS behavior and how I can have all these overflow messages in a normal application and how to tell the difference between these routine messages vs a real heap problem.

So, I dug deeper into the log. A normal execution for sending a command looks like:

7f99e6ffd700: cap 0: running thread 7 (ThreadRunGHC)
7f99e6ffd700: cap 0: thread 7 stopped (heap overflow)
7f99e6ffd700: cap 0: requesting parallel GC
7f99e6ffd700: cap 0: starting GC
7f99e6ffd700: cap 0: GC working
7f99e6ffd700: cap 0: GC idle
7f99e6ffd700: cap 0: GC done
7f99e6ffd700: cap 0: GC idle
7f99e6ffd700: cap 0: GC done
7f99e6ffd700: cap 0: all caps stopped for GC
7f99e6ffd700: cap 0: finished GC
7f99e6ffd700: cap 0: running thread 7 (ThreadRunGHC)
7f99e6ffd700: cap 0: sendCommand
7f99e6ffd700: cap 0: sendCommand: encoded
7f99e6ffd700: cap 0: sendCommand: size 4
7f99e6ffd700: cap 0: sendCommand: unpacked
7f99e6ffd700: cap 0: Sending command of size 4
7f99e6ffd700: cap 0: Sending command of size "\NUL\EOT"
7f99e6ffd700: cap 0: sendCommand: sent
7f99e6ffd700: cap 0: sendCommand: flushed
7f99e6ffd700: cap 0: thread 7 stopped (blocked on an MVar)
7f99e6ffd700: cap 0: running thread 2 (ThreadRunGHC)
7f99e6ffd700: cap 0: thread 2 stopped (yielding)
7f99e6ffd700: cap 0: running thread 45 (ThreadRunGHC)
7f99e6ffd700: cap 0: fetchTelemetryServer
7f99e6ffd700: cap 0: fetchTelemetryServer: got lock

The thread is run, overflows, GC, runs, then blocks on an MVAr.

For a the hang case:

7f99e6ffd700: cap 0: running thread 7 (ThreadRunGHC)
7f99e6ffd700: cap 0: sendCommand
7f99e6ffd700: cap 0: thread 7 stopped (heap overflow)
7f99e6ffd700: cap 0: requesting parallel GC
7f99e6ffd700: cap 0: starting GC
7f99e6ffd700: cap 0: GC working
7f99e6ffd700: cap 0: GC idle
7f99e6ffd700: cap 0: GC done
7f99e6ffd700: cap 0: GC idle
7f99e6ffd700: cap 0: GC done
7f99e6ffd700: cap 0: all caps stopped for GC
7f99e6ffd700: cap 0: finished GC
7f9a05362a40: cap 0: running thread 1408 (ThreadRunGHC)
7f9a05362a40: cap 0: thread 1408 stopped (yielding)
7f99e6ffd700: cap 0: running thread 7 (ThreadRunGHC)
7f99e6ffd700: cap 0: thread 7 stopped (heap overflow)
7f99e6ffd700: cap 0: requesting parallel GC
7f99e6ffd700: cap 0: starting GC
7f99e6ffd700: cap 0: GC working
7f99e6ffd700: cap 0: GC idle
7f99e6ffd700: cap 0: GC done
7f99e6ffd700: cap 0: GC idle
7f99e6ffd700: cap 0: GC done
7f99e6ffd700: cap 0: all caps stopped for GC
7f99e6ffd700: cap 0: finished GC
7f99e6ffd700: cap 0: running thread 7 (ThreadRunGHC)
7f99e6ffd700: cap 0: thread 7 stopped (yielding)
7f99e6ffd700: cap 0: running thread 2 (ThreadRunGHC)
7f99e6ffd700: cap 0: thread 2 stopped (yielding)
7f99e6ffd700: cap 0: running thread 45 (ThreadRunGHC)
7f99e6ffd700: cap 0: fetchTelemetryServer
7f99e6ffd700: cap 0: fetchTelemetryServer: got lock
...
7f99e6ffd700: cap 0: fetchTelemetryServer: got lock
7f99e6ffd700: cap 0: fetchTelemetryServer: unlock
7f99e6ffd700: cap 0: fetchTelemetryServer
7f99e6ffd700: cap 0: fetchTelemetryServer: got lock
7f99e6ffd700: cap 0: fetchTelemetryServer: unlock
7f99e6ffd700: cap 0: fetchTelemetryServer
7f99e6ffd700: cap 0: thread 45 stopped (yielding)
7f9a05362a40: cap 0: running thread 1408 (ThreadRunGHC)
7f9a05362a40: cap 0: thread 1408 stopped (suspended while making a foreign call)
7f9a05362a40: cap 0: running thread 1408 (ThreadRunGHC)
7f9a05362a40: cap 0: thread 1408 stopped (suspended while making a foreign call)
7f99e6ffd700: cap 0: running thread 7 (ThreadRunGHC)
7f99e6ffd700: cap 0: thread 7 stopped (blocked on black hole owned by thread 7)
7f9a05362a40: cap 0: running thread 1408 (ThreadRunGHC)
7f9a05362a40: cap 0: thread 1408 stopped (suspended while making a foreign call)

run, overflow, GC, run, overflow, GC, run, yield, other stuff, run blocked on black hole

And that is the last activity for thread 7.

I found a few links about black holes and such, so I’ll go off and read those and try to learn what the RTS is doing and why it can hang on a black hole. But if you have some hits, let me know. 

Mike



On Nov 11, 2014, at 4:01 PM, Gregory Collins <greg@gregorycollins.net> wrote:


On Tue, Nov 11, 2014 at 2:06 PM, Michael Jones <mike@proclivis.com> wrote:
Those are all over the log even when it runs properly. So I assume the runtime is resizing the heap or something.

No, it means you're exhausting the heap (maybe the runtime stack for the thread running "encode"), probably because "encode" is infinite-looping. I think Occam's razor applies here, check that any recursion you're doing is actually reducing the recursive argument. Perhaps you could post the code (e.g. http://gist.github.com/)?

G


--
Gregory Collins <greg@gregorycollins.net>

_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users