RE: Segfaulting programs with GHC 6.4.1

On 22 October 2005 03:25, John Goerzen wrote:
Here's some more data.
I tried this program with three versions of GHC.
GHC 6.4 with forkIO: could not produce a segfault at all GHC 6.4 with forkOS: rapid segfault (<30 sec)
GHC 6.4.1 with forkIO: segfault in <5 min GHC 6.4.1 with forkOS: segfault in <1 min
GHC 6.5: same results as GHC 6.4.1
Now, I also obtained a backtrace from GHC 6.4 with forkOS. Here's a bit of it:
#0 0x080c2e3e in ForeignziCziString_zdwpeekCAString_info () #1 0xb7a9e868 in ?? () #2 0x00000020 in ?? () #3 0xb7ecb08c in _res () from /lib/tls/libc.so.6 #4 0xb7ad80ec in ?? () #5 0x00000004 in ?? () #6 0x016b0d36 in ?? () #7 0xb7ac3f9c in ?? () #8 0x93c0cef9 in ?? () #9 0x4359a1e1 in ?? () #10 0x000742d3 in ?? () #11 0x00000000 in ?? () .... #2021 0x00000000 in ?? () #2022 0xb7ed80a6 in __pthread_mutex_cond_lock () from /lib/tls/libpthread.so.0 #2023 0x080c2f4c in ForeignziCziString_zdwpeekCAString_info ()
now this is the first time something from pthread actually appeared, which is interesting.
The code surrounding that CAString_info is:
0x080c2f44
: and %eax,(%eax) 0x080c2f46 : add %al,(%eax) 0x080c2f48 : and $0x0,%al 0x080c2f4a : add %al,(%eax) 0x080c2f4c : movl $0x80c2f10,0x0(%ebp) 0x080c2f53 : jmp *(%esi) 0x080c2f55 : nop 0x080c2f56 : nop
Hi John, Thanks for trying to narrow this down. At this stage it looks like some kind of heap corruption. Can you reproduce it on more than one machine? (we have to rule out hardware failure, it's happened before and can cost a lot of debugging time). BTW, the stack trace isn't useful beyond the top element, the reason being that GHC's virtual machine doesn't use the C stack. gdb can usually tell you which bit of code the crash happened in, but if it was in Haskell code then you don't have any information from the C stack. I'll need to reproduce it here. Can you give me a set of instructions to get me up to the right point? Cheers, Simon

On Mon, Oct 24, 2005 at 10:53:48AM +0100, Simon Marlow wrote:
Hi John,
Thanks for trying to narrow this down. At this stage it looks like some kind of heap corruption. Can you reproduce it on more than one machine?
Yes, though it is not nearly as easy. I cannot really explain that. I suspect it could have something to do with the order of data coming from the DB (it's unordered) or system load or something else along those lines. Here's another odd thing: the binaries built on the two systems are not quite identical, even though, as far as I can tell, everything about the build environment is identical (Debian sid). One is a few K larger than the other, and I can't figure out why. Both are fairly new, nice workstations from HP. I've had no trouble like this with any other program on either, and this isn't the first task like this either place. Also, it seems that the binary produced on one is more prone to crash than that produced on the other. But it could be my imagination.
(we have to rule out hardware failure, it's happened before and can cost a lot of debugging time).
I'll need to reproduce it here. Can you give me a set of instructions to get me up to the right point?
Here goes. Reminder, my test environment is Linux x86, ghc 6.4.1: 1. Install PostgreSQL 8.0. You can get this with most Linux distros, or from www.postgreql.org. 2. As your PostgreSQL user (usually you may need to su to postgres), run: createuser smarlow createdb smarlow createlang plpgsql smarlow (In this and following steps, replace "smarlow" with your Linux username, if it's not "smarlow") 3. Download http://www.complete.org/~jgoerzen/dump.bz2 (7.7MB) 4. Back as your normal smarlow user, run: bzcat dump.bz2 | sed 's/ jgoerzen/ smarlow/' > dump.sql (spaces and quotes are important there; unpacks to 190MB) psql -f dump.sql -U smarlow smarlow There will be four errors at the beginning that you can ignore. ("must be owner of schema public", 2x "permission denied for language c", "must be superuser to create procedural language") This will probably take a few minutes to run. I think it will take up about 500MB of disk space once loaded. 5. Install prerequisites. You will need HSQL 1.6 and the HSQL PostgreSQL module, plus MissingH 0.12.1 from http://http.us.debian.org/debian/pool/main/m/missingh/missingh_0.12.1.tar.gz . Both are cabalized. 6. Now, get the code. darcs get http://darcs.complete.org/gopherbot ghc --make -o setup Setup.lhs ./setup configure ./setup build 7. Create the directory /home/jgoerzen/tree/gopher-arch on your system, making sure that your smarlow user has read access to it. (The data stored in the DB, as well as a config, references that path for now. Sorry.) 8. Adjust these settings in your postgresql.conf, making sure to remove the existing values, if any: shared_buffers = 3000 sort_mem = 4000 maintenance_work_mem = 96000 work_mem = 64000 fsync = off checkpoint_segments = 12 effective_cache_size = 8000 And then restart the PostgreSQL server. 9. Now run dist/build/gopherbot. You should see it start to download documents, and crash after a few minutes. If you have trouble connecting, adjust the first empty string on line 42 of DB.hs to match unix_socket_directory in your postgresql.conf. The settings made in step 8 make PostgreSQL much faster. Without them, it is hard to make the program crash. The program will use about 500MB RAM while running. It will take about 10 minutes to get up to speed. (It takes a bit to load its worklist from PostgreSQL, and to eliminate some dead hosts.) After that, it'll start up quicker, and run fast. I'll also keep trying to gather data here.
participants (2)
-
John Goerzen
-
Simon Marlow