
On 15/06/2010 20:43, braver wrote:
On Jun 15, 6:27 am, Simon Marlow
wrote: On 15/06/2010 06:09, braver wrote:
In fact, the tag cafe2, when run on the full dataset, gets stuck at 11 days, with RAM slowly getting into 50 GB; a previous version caused ghc 6.12.1 to segfault around day 12 -- -debug showing an assert failure in Storage.c. ghc 6.10 got stuck at 30 days for good, and when profiling crashed twice with a "strange closure" or a stack overflow. So allocation is a problem still.
I'd be happy to help you track this down, but I don't have a machine big enough. Do you have any runs that display a problem with a smaller heap (< 16GB)?
If the program is apparently hung, try connecting to it with 'gdb --pid=<pid>' and doing 'info thread' and 'where'. That might give me enough clues to find out where the problem is.
Is this with -threaded, BTW? With residency on that scale, I'd expect the parallel GC to help quite a lot. But obviously getting it to not crash/hang is the first priority :)
Simon - thanks for the tips, this is what gdb says when it's stuck at 45 GB when limited with -A5G -M40G:
... 0x00000000004c3c21 in free_mega_group () (gdb) info thread * 1 Thread 0x2b21c1da4dc0 (LWP 10210) 0x00000000004c3c21 in free_mega_group () (gdb) where #0 0x00000000004c3c21 in free_mega_group () #1 0x00000000004c3ff9 in freeChain () #2 0x00000000004c5ab0 in GarbageCollect () #3 0x00000000004bff96 in scheduleDoGC () #4 0x00000000004c0b25 in scheduleWaitThread () #5 0x00000000004bea09 in real_main () #6 0x00000000004beb17 in hs_main () #7 0x00000037d5a1d974 in __libc_start_main () from /lib64/libc.so.6 #8 0x0000000000402ca9 in _start ()
Thanks. I don't see anything obviously wrong in free_mega_group() - it's part of the memory manager that returns a multi-MB block to the internal free list, and it looks down the free list to find the right place to put it, coalescing with adjacent free blocks if possible. If it is looping here, that means the free list has a cycle, which is very bad indeed. Could you try a few more things for me? - type 'display /i $pc' and then single step with 'si' for a while when it is in this state. That will tell us whether it's looping here or not. - compile with -debug and run again. That turns on a bunch of assertions. You could also try adding +RTS -DS, this turns on more sanity checking (and will slow things down a lot). If you are comfortable giving me a login on your machine then I could debug it directly, let me know. Cheers, Simon