RE: Segfaulting programs with GHC 6.4.1

On 21 October 2005 13:27, John Goerzen wrote:
It wasn't from my build tree, but:
jgoerzen@katherina:/usr/lib/ghc-6.4.1$ grep -ri s34n_info * Binary file libHSbase.a matches Binary file libHSdata.a matches
I unpacked libHSbase.a and found:
$ grep -ri s34n_info . Binary file ./String__1.o matches Binary file ./Posix__6.o matches (this one doesn't match if I omit -i)
Identical results from libHSdata.a.
I thought I would also look at s3eu_info. It too is in libHSbase.a:
$ grep -r s3eu_info . Binary file ./Internals__19.o matches Binary file ./String__1.o matches
objdump -x String__1.o yields, among other things: ... SYMBOL TABLE: 00000000 l d .text 00000000 .text 00000068 l O .text 0000000c s34n_info 00000190 l O .text 00000008 s3et_info 000000c4 l O .text 00000008 s351_info 00000160 l O .text 00000008 s3eu_info 00000000 l d .data 00000000 .data 00000000 l d .bss 00000000 .bss 00000000 g O .data 00000004 ForeignziCziString_zdwpeekCAString_closure 0000000c g O .text 0000000c ForeignziCziString_zdwpeekCAString_info 00000000 *UND* 00000000 GHCziBase_Izh_con_info 00000000 *UND* 00000000 GHCziBase_Czh_con_info 00000000 *UND* 00000000 GHCziBase_ZC_con_info 00000000 *UND* 00000000 stg_gc_ut 00000000 *UND* 00000000 GHCziBase_ZMZN_closure
Now, I may be totally reading this wrong, but seeing several references to Foreign and String made me think of CStrings. That made me suspicious of this function. You may remember I asked about it on IRC, and we thought it was OK:
msg :: String -> IO () msg l = do t <- myThreadId let disp = (show t) ++ ": " ++ l ++ "\n" withCStringLen disp (\(c, len) -> hPutBuf stdout c len >> hFlush stdout)
This may be a complete red herring, but I just thought I'd bring it up. Maybe it's a clue anyway.
Here's the disassemble output:
Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1407190096 (LWP 31315)] 0x080ba84a in s34n_info () (gdb) disassemble s34n_info Dump of assembler code for function s34n_info: 0x080ba834
: mov %edi,0x8(%esp) 0x080ba838 : mov %edi,%eax 0x080ba83a : add $0x8,%eax 0x080ba83d : mov %eax,%edi 0x080ba83f : cmp 0x5c(%ebx),%eax 0x080ba842 : ja 0x80ba86b 0x080ba844 : mov 0x0(%ebp),%edx 0x080ba847 : mov 0x4(%esi),%eax 0x080ba84a : cmpb $0x0,(%eax,%edx,1)
the crash is here ^^^ it does appear to be something string-related, because that instruction isn't a common one in GHC-generated code - comparing a byte against zero, with a non-trivial addressing mode, looks very much like a string traversal. Do you have any other CString-related code in your program? Cheers, Simon

On Fri, Oct 21, 2005 at 01:40:38PM +0100, Simon Marlow wrote:
0x080ba84a
: cmpb $0x0,(%eax,%edx,1) the crash is here ^^^ it does appear to be something string-related, because that instruction isn't a common one in GHC-generated code - comparing a byte against zero, with a non-trivial addressing mode, looks very much like a string traversal.
Do you have any other CString-related code in your program?
No. But my program uses HSQL, which has it *all over* the place. So many uses, in fact, that I have really no idea where to begin tracking it down. I wonder if the bug is in HSQL or GHC? -- John

If I remember exactly you said that your program was working before movement to ghc 6.4.1. If that is the case then I doubt that the problem is in HSQL. In any case the easy way to catch that is to add some trace messages in HSQL. At least traces at the beginning and at the end of query, execute and getColValue functions should be enough. Note that I have never tried HSQL with the multithreaded RTS. Cheers, Krasimir

On Fri, Oct 21, 2005 at 03:17:34PM +0200, Krasimir Angelov wrote:
If I remember exactly you said that your program was working before movement to ghc 6.4.1. If that is the case then I doubt that the problem is in HSQL. In any case the easy way to catch that is to add some trace messages in HSQL. At least traces at the beginning and at
Unfortunately, this has been elusive so far, perhaps due to the multithreaded nature of the program. It appears to be crashing at different places different times -- after calling fetch one time, in the middle of query the next. I am trying to narrow it down a bit by reducing the number of threads and hoping I can still duplicate the problem in a reasonable amount of time. I should also add that I'm putting these debug statements in the PostgreSQL module, but I had observed these segfaults previously when I was using Sqlite3. I switched from sqlite3 to postgresql both for performance reasons, and to try to narrow down this problem. It helped with the performance, anyway ;-) -- John

On Fri, Oct 21, 2005 at 03:17:34PM +0200, Krasimir Angelov wrote:
If I remember exactly you said that your program was working before movement to ghc 6.4.1. If that is the case then I doubt that the problem is in HSQL. In any case the easy way to catch that is to add some trace messages in HSQL. At least traces at the beginning and at the end of query, execute and getColValue functions should be enough. Note that I have never tried HSQL with the multithreaded RTS.
Let me just give an update on this. I have continued to add debug information but still cannot get a reliable crash point. I have also switched to forkIO in an attempt to narrow down the problem. It didn't appear to make any difference. gb175 I've seen some crashes in HSQL code appear to happen before the 244 in this snippet: getColValue pRes tupleIndex countTuples colNumber (name,sqlType,nullable) f = do ldebug "getcolvalue entered" index <- readMVar tupleIndex ldebug "244" when (index >= countTuples) (throwDyn SqlNoData) ldebug "246" (so, on the readMVar line.) Others appeared to occur after the 261 here: ldebug "259" case mb_value of Just v -> do ldebug "261" return (Just v) I'll let you know if I can find out anything more.

Here's some more data.
I tried this program with three versions of GHC.
GHC 6.4 with forkIO: could not produce a segfault at all
GHC 6.4 with forkOS: rapid segfault (<30 sec)
GHC 6.4.1 with forkIO: segfault in <5 min
GHC 6.4.1 with forkOS: segfault in <1 min
GHC 6.5: same results as GHC 6.4.1
Now, I also obtained a backtrace from GHC 6.4 with forkOS. Here's a
bit of it:
#0 0x080c2e3e in ForeignziCziString_zdwpeekCAString_info ()
#1 0xb7a9e868 in ?? ()
#2 0x00000020 in ?? ()
#3 0xb7ecb08c in _res () from /lib/tls/libc.so.6
#4 0xb7ad80ec in ?? ()
#5 0x00000004 in ?? ()
#6 0x016b0d36 in ?? ()
#7 0xb7ac3f9c in ?? ()
#8 0x93c0cef9 in ?? ()
#9 0x4359a1e1 in ?? ()
#10 0x000742d3 in ?? ()
#11 0x00000000 in ?? ()
....
#2021 0x00000000 in ?? ()
#2022 0xb7ed80a6 in __pthread_mutex_cond_lock () from /lib/tls/libpthread.so.0
#2023 0x080c2f4c in ForeignziCziString_zdwpeekCAString_info ()
now this is the first time something from pthread actually appeared,
which is interesting.
The code surrounding that CAString_info is:
0x080c2f44

On Fri, Oct 21, 2005 at 09:24:44PM -0500, John Goerzen wrote:
Here's some more data.
I tried this program with three versions of GHC.
GHC 6.4 with forkIO: could not produce a segfault at all
Apparently I was wrong. This one now did produce a segfault after about 20 minutes of running. I will try to get some information. -- John
participants (3)
-
John Goerzen
-
Krasimir Angelov
-
Simon Marlow