
On Mon, Oct 24, 2005 at 10:53:48AM +0100, Simon Marlow wrote:
Hi John,
Thanks for trying to narrow this down. At this stage it looks like some kind of heap corruption. Can you reproduce it on more than one machine?
Yes, though it is not nearly as easy. I cannot really explain that. I suspect it could have something to do with the order of data coming from the DB (it's unordered) or system load or something else along those lines. Here's another odd thing: the binaries built on the two systems are not quite identical, even though, as far as I can tell, everything about the build environment is identical (Debian sid). One is a few K larger than the other, and I can't figure out why. Both are fairly new, nice workstations from HP. I've had no trouble like this with any other program on either, and this isn't the first task like this either place. Also, it seems that the binary produced on one is more prone to crash than that produced on the other. But it could be my imagination.
(we have to rule out hardware failure, it's happened before and can cost a lot of debugging time).
I'll need to reproduce it here. Can you give me a set of instructions to get me up to the right point?
Here goes. Reminder, my test environment is Linux x86, ghc 6.4.1: 1. Install PostgreSQL 8.0. You can get this with most Linux distros, or from www.postgreql.org. 2. As your PostgreSQL user (usually you may need to su to postgres), run: createuser smarlow createdb smarlow createlang plpgsql smarlow (In this and following steps, replace "smarlow" with your Linux username, if it's not "smarlow") 3. Download http://www.complete.org/~jgoerzen/dump.bz2 (7.7MB) 4. Back as your normal smarlow user, run: bzcat dump.bz2 | sed 's/ jgoerzen/ smarlow/' > dump.sql (spaces and quotes are important there; unpacks to 190MB) psql -f dump.sql -U smarlow smarlow There will be four errors at the beginning that you can ignore. ("must be owner of schema public", 2x "permission denied for language c", "must be superuser to create procedural language") This will probably take a few minutes to run. I think it will take up about 500MB of disk space once loaded. 5. Install prerequisites. You will need HSQL 1.6 and the HSQL PostgreSQL module, plus MissingH 0.12.1 from http://http.us.debian.org/debian/pool/main/m/missingh/missingh_0.12.1.tar.gz . Both are cabalized. 6. Now, get the code. darcs get http://darcs.complete.org/gopherbot ghc --make -o setup Setup.lhs ./setup configure ./setup build 7. Create the directory /home/jgoerzen/tree/gopher-arch on your system, making sure that your smarlow user has read access to it. (The data stored in the DB, as well as a config, references that path for now. Sorry.) 8. Adjust these settings in your postgresql.conf, making sure to remove the existing values, if any: shared_buffers = 3000 sort_mem = 4000 maintenance_work_mem = 96000 work_mem = 64000 fsync = off checkpoint_segments = 12 effective_cache_size = 8000 And then restart the PostgreSQL server. 9. Now run dist/build/gopherbot. You should see it start to download documents, and crash after a few minutes. If you have trouble connecting, adjust the first empty string on line 42 of DB.hs to match unix_socket_directory in your postgresql.conf. The settings made in step 8 make PostgreSQL much faster. Without them, it is hard to make the program crash. The program will use about 500MB RAM while running. It will take about 10 minutes to get up to speed. (It takes a bit to load its worklist from PostgreSQL, and to eliminate some dead hosts.) After that, it'll start up quicker, and run fast. I'll also keep trying to gather data here.