libefence useful for debugging ghc+ffi programs?

Hi. I have an application compiled with ghc and using C via FFI (and lots of Haskell libraries as well) and it keeps annoying me with segfaults and (worse) unpredictable behaviour. I guess a C programmer would now apply electric-fence or valgrind. Is this a reasonable approach for ghc+ffi as well? E.g., this: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x4114f950 (LWP 8343)] 0x00007f5af4e273b5 in free () from /usr/lib/libefence.so.0 (gdb) where #0 0x00007f5af4e273b5 in free () from /usr/lib/libefence.so.0 #1 0x00007f5af4059519 in ?? () from /usr/lib/libcurl.so.4 #2 0x00007f5af4059748 in ?? () from /usr/lib/libcurl.so.4 #3 0x00007f5af4059abd in ?? () from /usr/lib/libcurl.so.4 #4 0x00007f5af405f080 in ?? () from /usr/lib/libcurl.so.4 #5 0x00000000007bef95 in ?? () #6 0x00007f5aefb66628 in ?? () #7 0x00007f5af0df9f00 in ?? () #8 0x0000000000000000 in ?? () Is this a reason to distrust libcurl (the C libs) or curl (the Haskell package)? Thanks - J.W.

some more info on this: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x42773950 (LWP 29449)] 0x00007f717c70e370 in free () from /usr/lib/libefence.so.0 (gdb) where #0 0x00007f717c70e370 in free () from /usr/lib/libefence.so.0 #1 0x00007f717b931ee9 in conn_free () from /usr/local/lib/libcurl.so.4 #2 0x00007f717b9326fd in Curl_disconnect () from /usr/local/lib/libcurl.so.4 #3 0x00007f717b932865 in ConnectionKillOne () from /usr/local/lib/libcurl.so.4 #4 0x00007f717b9348a8 in Curl_close () from /usr/local/lib/libcurl.so.4 #5 0x0000000004000001 in ?? () is curl OK with the threaded runtime? I realize might have one curl connection active via hxt, and another one because I call it directly. J.W.

That looks a lot like a double free, for what it's worth. Do the errors go away if you turn off threading? Edward

That looks a lot like a double free [...]
there's definitely something about initializing libcurl: http://curl.haxx.se/libcurl/c/curl_easy_init.html uses nice phrases like "may be letal in multi-threading" the documentation of Haskell curl http://hackage.haskell.org/packages/archive/curl/1.3.5/doc/html/Network- Curl.html just says "withCurlDo should be called once" while in fact it should be much stronger: "must be called exactly once"? anyway I temporarily dropped curl (replaced by system "wget" ) and the erratic behaviour persists. Now it looks like this: Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens. [New Thread 0x7fc37b70f6e0 (LWP 7281)] [New Thread 0x4122a950 (LWP 7284)] [New Thread 0x4214a950 (LWP 7285)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fc37b70f6e0 (LWP 7281)] 0x00007fc37b715aae in memalign () from /usr/lib/libefence.so.0 (gdb) where #0 0x00007fc37b715aae in memalign () from /usr/lib/libefence.so.0 #1 0x00007fc37b715c97 in malloc () from /usr/lib/libefence.so.0 #2 0x00007fc37a492a55 in gethostbyname () from /lib/libc.so.6 #3 0x000000000089cc88 in networkzm2zi2zi1zi7_NetworkziBSD_zdwlvl_info () #4 0x0000000000000000 in ?? () and indeed, gethostbyname is famous for being non re-entrant. (packages I use are HTTP, hxt, happstack-server; and I can't drop all of them ...)

Excerpts from Johannes Waldmann's message of Wed Oct 20 05:13:36 -0400 2010:
and indeed, gethostbyname is famous for being non re-entrant.
If you have the time, this would be a great time to improve the multithreaded support of these libraries. In particular, glibc offers a re-entrant version gethostbyname_r, so at least for some POSIX systems network could be switched to using that. If all else fails, perhaps manually synchronize over an MVar and hope no one else imports the FFI. If you don't have the time, if you can identify where gethostbyname is getting called from, hopefully you can manually synchronize those sections of the program. Edward

and indeed, gethostbyname is famous for being non re-entrant.
it already has a lock in Network.BSD, so I assume it's fine: {-# NOINLINE lock #-} lock :: MVar () lock = unsafePerformIO $ newMVar () withLock :: IO a -> IO a withLock act = withMVar lock (\_ -> act) getHostByName :: HostName -> IO HostEntry getHostByName name = withLock $ do withCString name $ \ name_cstr -> do ent <- throwNoSuchThingIfNull "getHostByName" "no such host entry" $ trySysCall $ c_gethostbyname name_cstr peek ent

Hmm, in that case, one possibility is someone else did an FFI import of gethostbyname and isn't using the same lock. Can you check for that? Edward Excerpts from Johannes Waldmann's message of Wed Oct 20 16:17:06 -0400 2010:
and indeed, gethostbyname is famous for being non re-entrant.
it already has a lock in Network.BSD, so I assume it's fine:
{-# NOINLINE lock #-} lock :: MVar () lock = unsafePerformIO $ newMVar ()
withLock :: IO a -> IO a withLock act = withMVar lock (\_ -> act)
getHostByName :: HostName -> IO HostEntry getHostByName name = withLock $ do withCString name $ \ name_cstr -> do ent <- throwNoSuchThingIfNull "getHostByName" "no such host entry" $ trySysCall $ c_gethostbyname name_cstr peek ent

OK, never mind, I found the problem in my C code. some uninitialized variables - mostly they were 0, but sometimes not: I guess when I got mallocForeignPtrBytes that were just freed by the garbage collector. Although the program does a ton of allocations, most start with memcpy of something that the program computed earlier, expect for a handfull of root nodes. But these were allocated early, when there was little garbage, so I got most of them in their zeroed-out initial state, and that's why the error did not show. (At least that's my guess.) This restores my faith in ghc, ffi, and library writers - and shows how much I unlearned C programming (which I guess is generally a good thing - except when you program C). J.W.
participants (2)
-
Edward Z. Yang
-
Johannes Waldmann