
On Tue, Feb 26, 2013 at 12:21:40PM -0500, Brandon Allbery wrote:
On Mon, Feb 25, 2013 at 5:10 PM, Zev Weiss
wrote: For the record, in case anyone else happens to encounter this -- it was pointed out to me by a helpful individual off-list that this is actually a known problem when running binaries mmaped out of AFS, where my xmonad binary happens to reside. I've changed my xsession script to run it out of a local filesystem instead and am no longer seeing this behavior.
Can you give me any more information about this? Simply running executables out of AFS does not have any known issues; if it did, Carnegie Mellon University (my previous employer) would have run headlong into it long since, and it would have been fixed by now.
This is a problem I have been annoyed by for a few years now and I've had limited success in tracking it down. The problem doesn't affect all binaries - seemingly just haskell binaries. It also gets worse with larger haskell binaries. The problem seems to be related to the state of the AFS cache somehow. Just after a reboot with a cold cache, I have to run ghc (some of my GHC installs are on AFS) 5+ times in a row to get it to do anything besides die with a SIGBUS. The same goes for pandoc. After the binary starts up properly the first time, it seems to be in cache and doesn't act up until it gets kicked out of cache. Here is an old cafe thread where I tried to track this down - not many other people reported the problem, but those who did seemed resigned to it: https://groups.google.com/forum/?fromgroups=#!searchin/haskell-cafe/tristan$... That post highlights a separate but seemingly related problem. There GHC fails when it hits some TH code and has to load a few libraries off of disk during compilation. I don't know exactly what the ghci linker does there, but it is prepping that code for execution and explodes if the libraries it is loading are not in cache. In those cases, I have to keep running 'cabal install' and ghc keeps making forward progress, loading a few more successfully each time. Eventually they are all in cache and it works. My guess is that the problem is some bad interaction between whatever the GHC RTS does for file IO and AFS, but it is hard to figure out where to start looking. I have never gotten a useful backtrace in any of these crashes. Most applications don't have any problems, so I imagine it has to be GHC somehow. That said, I've seen some similar crashes in non-Haskell code if a program is using shared libraries that live on AFS. if some application eats all of your memory and caches start getting evicted, sometimes those applications with AFS-based shared libraries explode in a similar way. Any insight would definitely be appreciated, since this annoys me a few times a day.