
Hi Greg, Gregory Wright wrote:
Some data and a few questions:
1. The failure on FreeBSD is not the same as on OS X. I built 6.4.2 from cvs on FreeBSD 6.1, and ran the ghc-regress tests. The tests took a long time to run (about 14 hours on a dual Xeon 2.8 GHz with 2 GB of memory). Towards the end of the tests, there were about 30 "timeout" processes running, apparently doing nothing but consuming cpu cycles.
Ok, this is certainly a problem with forkOS in the threaded RTS in 6.4.2 on FreeBSD. I probably need to get access to a FreeBSD box to fix this myself, the code is pretty delicate (and sadly it has completely changed in 6.6, too). It might be worth trying with -lthr instead of -lpthread, according to Robert Watson. This switches to an alternative, 1:1, threading library.
2. Notes on reproducing the FreeBSD 6.4.2 build: I used
fpconfig from the ghc-6-4 branch; ghc, libraries, hslibs and testsuite from the ghc-6-4-2 branch; gnu make 3.80; autoconf 2.59.
Gnu make 3.81 went into an infinite loop, much as gnu make 3.79 did when building ghc on OS X.
That's odd, the fix for make 3.79 is in the 6.4.2 tree (rev. 1.82.2.2 of mk/suffix.mk). Something else must be happening with 3.81, sigh.
3. Did the threaded RTS work on 6.4.1? Was it used by default?
Presumably not. In 6.4.2 we switched to using the threaded RTS by default for GHC itself, which has forced the problem to the surface. Also there were some changes to the timeout program in the testsuite, which have apparently forced some other problems to the surface.
I can provide an RTS thread listing (+RTS -Ds) if that would be a starting point. Someone would have to explain what it means to me, though.
4. When running with debugging turned on, I have seen the assertion failure
ghc-6.4.2: internal error: ASSERTION FAILED: file GC.c, line 4356 Please report this as a compiler bug. See: http://www.haskell.org/ghc/reportabug
This points toward the stack being corrupted. Maybe a thread overflowing its stack? I'm not sure. The assertion that fails is
ASSERT(frame < bottom);
It looks as if something has messed up the stack before this.
Ok, it would help to find a smaller program that crashes with -threaded: debugging GHC itself is quite hard because it's difficult to get a deterministic run and hence reproducibility. Look at your testsuite failures and find threaded failures that aren't due to the compiler crashing (or just build stage2 without -threaded and run the testsuite again). Tests in concurrent/ are a good bet. When we have a smallish program that crashes, we can start debugging.
I am willing to dig into this, but I need a bit more help with where to start.
Thanks for your help! Cheers, Simon