Memory management issue in notmuch-haskell bindings

It seems that the notmuch-haskell bindings (version 0.2.2 built against notmuch from git master; passes notmuch-test) aren't dealing with memory management properly. In particular, the attached test code[1] causes talloc to abort. Unfortunately, while the issue is consistently reproducible, it only occurs with some queries (see source[1]). I have been unable to establish the exact criterion for failure. It seems that the crash is caused by an invalid access to a freed Query object while freeing a Messages object (see Valgrind trace[3]). I've taken a brief look at the bindings themselves but, being only minimally familiar with the FFI, there's nothing obviously wrong (the finalizers passed to newForeignPtr look sane). I was under the impression that talloc was reference counted, so the Query object shouldn't have been freed unless if there was still a Messages object holding a reference. Any idea what might have gone wrong here? Thanks! Cheers, - Ben [1] Test case, import Data.List import Control.Monad import System.Environment import Foreign.Notmuch dbpath = "/home/ben/.mail" getAddresses :: Database -> String -> IO [String] getAddresses db q = do query <- queryCreate db q msgs <- queryMessages query addrs <- mapM (flip messageGetHeader $ "From") msgs return addrs main = do db <- databaseOpen dbpath DatabaseModeReadOnly --addrs2 <- getAddresses db "tag:haskell" -- This succeeds addrs3 <- getAddresses db "to:dietz" -- This fails --print addrs2 --print addrs3 databaseClose db [2] Crashed session and backtrace, [1217 ben@ben-laptop ~] $ ghc test.hs -auto-all -rtsopts -prof && gdb ./test GNU gdb (Ubuntu/Linaro 7.2-1ubuntu11) 7.2 Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /home/ben/test...(no debugging symbols found)...done. (gdb) run Starting program: /home/ben/test [Thread debugging using libthread_db enabled] Program received signal SIGABRT, Aborted. 0x00007ffff5979d05 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. in ../nptl/sysdeps/unix/sysv/linux/raise.c (gdb) bt #0 0x00007ffff5979d05 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007ffff597dab6 in abort () at abort.c:92 #2 0x00007ffff6de038c in talloc_abort (reason=0x7ffff6de56e8 "Bad talloc magic value - access after free") at ../talloc.c:210 #3 0x00007ffff6de0271 in talloc_abort_access_after_free (ptr=0x769190, location=0x7ffff7bd4e04 "lib/messages.c:142") at ../talloc.c:229 #4 talloc_chunk_from_ptr (ptr=0x769190, location=0x7ffff7bd4e04 "lib/messages.c:142") at ../talloc.c:250 #5 _talloc_free (ptr=0x769190, location=0x7ffff7bd4e04 "lib/messages.c:142") at ../talloc.c:1164 #6 0x00007ffff7bc7e65 in notmuch_messages_destroy (messages=0x769190) at lib/messages.c:142 #7 0x00000000004de1c9 in scheduleFinalizers () #8 0x00000000004e013d in GarbageCollect () #9 0x00000000004d9e40 in scheduleDoGC.clone.18 () #10 0x00000000004db0e0 in exitScheduler () #11 0x00000000004d9066 in hs_exit_ () #12 0x00000000004d940a in shutdownHaskellAndExit () #13 0x00000000004d8a91 in real_main () #14 0x00000000004d8ade in hs_main () #15 0x00007ffff5964eff in __libc_start_main (main=0x408ed0 <main>, argc=1, ubp_av=0x7fffffffe4f8, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffe4e8) at libc-start.c:226 #16 0x0000000000407791 in _start () (gdb) [3] Valgrind output, ==25241== Memcheck, a memory error detector ==25241== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. ==25241== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info ==25241== Command: ./test ==25241== ==25241== Conditional jump or move depends on uninitialised value(s) ==25241== at 0x52BB510: inflateReset2 (in /lib/x86_64-linux-gnu/libz.so.1.2.3.4) ==25241== by 0x52BB605: inflateInit2_ (in /lib/x86_64-linux-gnu/libz.so.1.2.3.4) ==25241== by 0x5F211BE: ChertTable::lazy_alloc_inflate_zstream() const (chert_table.cc:1672) ==25241== by 0x5F23B06: ChertTable::read_tag(Cursor*, std::string*, bool) const (chert_table.cc:1264) ==25241== by 0x5F260F9: ChertTable::get_exact_entry(std::string const&, std::string&) const (chert_table.cc:1210) ==25241== by 0x5F26DE2: ChertTermList::ChertTermList(Xapian::Internal::RefCntPtr<ChertDatabase const>, unsigned int) (chert_termlist.cc:44) ==25241== by 0x5EFF2E5: ChertDatabase::open_term_list(unsigned int) const (chert_database.cc:891) ==25241== by 0x5E7E7FB: Xapian::Document::termlist_begin() const (omdocument.cc:176) ==25241== by 0x4E41822: _notmuch_message_ensure_metadata(_notmuch_message*) (message.cc:309) ==25241== by 0x4E429DC: _notmuch_message_ensure_filename_list(_notmuch_message*) (message.cc:632) ==25241== by 0x4E42C37: notmuch_message_get_filename (message.cc:698) ==25241== by 0x4E41CA7: _notmuch_message_ensure_message_file(_notmuch_message*) (message.cc:403) ==25241== ==25241== Invalid read of size 4 ==25241== at 0x5C25DED: _talloc_free (talloc.c:242) ==25241== by 0x4E39E64: notmuch_messages_destroy (messages.c:142) ==25241== by 0x4DE1C8: scheduleFinalizers (in /home/ben/test) ==25241== by 0x4E013C: GarbageCollect (in /home/ben/test) ==25241== by 0x4D9E3F: scheduleDoGC.clone.18 (in /home/ben/test) ==25241== by 0x4DB0DF: exitScheduler (in /home/ben/test) ==25241== by 0x4D9065: hs_exit_ (in /home/ben/test) ==25241== by 0x4D9409: shutdownHaskellAndExit (in /home/ben/test) ==25241== by 0x4D8A90: real_main (in /home/ben/test) ==25241== by 0x4D8ADD: hs_main (in /home/ben/test) ==25241== by 0x6F4FEFE: (below main) (libc-start.c:226) ==25241== Address 0x8324e90 is 64 bytes inside a block of size 136 free'd ==25241== at 0x4C282ED: free (vg_replace_malloc.c:366) ==25241== by 0x5C29884: _talloc_free_internal (talloc.c:714) ==25241== by 0x5C25F22: _talloc_free (talloc.c:667) ==25241== by 0x4E452BF: notmuch_query_destroy (query.cc:358) ==25241== by 0x4DE1C8: scheduleFinalizers (in /home/ben/test) ==25241== by 0x4E013C: GarbageCollect (in /home/ben/test) ==25241== by 0x4D9E3F: scheduleDoGC.clone.18 (in /home/ben/test) ==25241== by 0x4DB0DF: exitScheduler (in /home/ben/test) ==25241== by 0x4D9065: hs_exit_ (in /home/ben/test) ==25241== by 0x4D9409: shutdownHaskellAndExit (in /home/ben/test) ==25241== by 0x4D8A90: real_main (in /home/ben/test) ==25241== by 0x4D8ADD: hs_main (in /home/ben/test) ==25241== ==25241== Invalid read of size 8 ==25241== at 0x5C26251: _talloc_free (talloc.c:249) ==25241== by 0x4E39E64: notmuch_messages_destroy (messages.c:142) ==25241== by 0x4DE1C8: scheduleFinalizers (in /home/ben/test) ==25241== by 0x4E013C: GarbageCollect (in /home/ben/test) ==25241== by 0x4D9E3F: scheduleDoGC.clone.18 (in /home/ben/test) ==25241== by 0x4DB0DF: exitScheduler (in /home/ben/test) ==25241== by 0x4D9065: hs_exit_ (in /home/ben/test) ==25241== by 0x4D9409: shutdownHaskellAndExit (in /home/ben/test) ==25241== by 0x4D8A90: real_main (in /home/ben/test) ==25241== by 0x4D8ADD: hs_main (in /home/ben/test) ==25241== by 0x6F4FEFE: (below main) (libc-start.c:226) ==25241== Address 0x8324e80 is 48 bytes inside a block of size 136 free'd ==25241== at 0x4C282ED: free (vg_replace_malloc.c:366) ==25241== by 0x5C29884: _talloc_free_internal (talloc.c:714) ==25241== by 0x5C25F22: _talloc_free (talloc.c:667) ==25241== by 0x4E452BF: notmuch_query_destroy (query.cc:358) ==25241== by 0x4DE1C8: scheduleFinalizers (in /home/ben/test) ==25241== by 0x4E013C: GarbageCollect (in /home/ben/test) ==25241== by 0x4D9E3F: scheduleDoGC.clone.18 (in /home/ben/test) ==25241== by 0x4DB0DF: exitScheduler (in /home/ben/test) ==25241== by 0x4D9065: hs_exit_ (in /home/ben/test) ==25241== by 0x4D9409: shutdownHaskellAndExit (in /home/ben/test) ==25241== by 0x4D8A90: real_main (in /home/ben/test) ==25241== by 0x4D8ADD: hs_main (in /home/ben/test) ==25241== ==25241== ==25241== HEAP SUMMARY: ==25241== in use at exit: 192,328 bytes in 555 blocks ==25241== total heap usage: 19,852 allocs, 19,297 frees, 35,375,726 bytes allocated ==25241== ==25241== LEAK SUMMARY: ==25241== definitely lost: 0 bytes in 0 blocks ==25241== indirectly lost: 0 bytes in 0 blocks ==25241== possibly lost: 27,799 bytes in 292 blocks ==25241== still reachable: 164,529 bytes in 263 blocks ==25241== suppressed: 0 bytes in 0 blocks ==25241== Rerun with --leak-check=full to see details of leaked memory ==25241== ==25241== For counts of detected and suppressed errors, rerun with: -v ==25241== Use --track-origins=yes to see where uninitialised values come from ==25241== ERROR SUMMARY: 4 errors from 3 contexts (suppressed: 4 from 4)

On Tue, 16 Aug 2011 12:32:13 -0400, Ben Gamari
It seems that the notmuch-haskell bindings (version 0.2.2 built against notmuch from git master; passes notmuch-test) aren't dealing with memory management properly. In particular, the attached test code[1] causes talloc to abort. Unfortunately, while the issue is consistently reproducible, it only occurs with some queries (see source[1]). I have been unable to establish the exact criterion for failure.
It seems that the crash is caused by an invalid access to a freed Query object while freeing a Messages object (see Valgrind trace[3]). I've taken a brief look at the bindings themselves but, being only minimally familiar with the FFI, there's nothing obviously wrong (the finalizers passed to newForeignPtr look sane). I was under the impression that talloc was reference counted, so the Query object shouldn't have been freed unless if there was still a Messages object holding a reference. Any idea what might have gone wrong here? Thanks!
After looking into this issue in a bit more depth, I'm even more confused. In fact, I would not be surprised if I have stumbled into a bug in the GC. It seems that the notmuch-haskell bindings follow the example of the python bindings in that child objects keep references to their parents to prevent the garbage collector from releasing the parent, which would in turn cause talloc to free the child objects, resulting in odd behavior when the child objects were next accessed. For instance, the Query and Messages objects are defined as follows, type MessagesPtr = ForeignPtr S__notmuch_messages type MessagePtr = ForeignPtr S__notmuch_message newtype Query = Query (ForeignPtr S__notmuch_query) data MessagesRef = QueryMessages { qmpp :: Query, msp :: MessagesPtr } | ThreadMessages { tmpp :: Thread, msp :: MessagesPtr } | MessageMessages { mmspp :: Message, msp :: MessagesPtr } data Message = MessagesMessage { msmpp :: MessagesRef, mp :: MessagePtr } | Message { mp :: MessagePtr } type Messages = [Message] As seen in the Valgrind dump given in my previous message, it seems that the Query object is being freed before the Messages object. Since the Messages object is a child of the Query object, this fails. In my case, I'm calling queryMessages which begins by issuing a given notmuch Query, resulting in a MessagesPtr. This is then packaged into a QueryMessages object which is then passed off to unpackMessages. unpackMessages iterates over this collection, creating MessagesMessage objects which themselves refer to the QueryMessages object. Finally, these MessagesMessage objects are packed into a list, resulting in a Messages object. Thus we have the following chain of references, MessagesMessage | | msmpp \/ QueryMessages | | qmpp \/ Query As we can see, each MessagesMessage object in the Messages list resulting from queryMessages holds a reference to the Query object from which it originated. For this reason, I fail to see how it is possible that the RTS would attempt to free the Query before freeing the MessagesPtr. Did I miss something in my analysis? Are there tools for debugging issues such as this? Perhaps this is a bug in the GC? Any help at all would be greatly appreciated. Cheers, - Ben

On Sun, Aug 28, 2011 at 4:27 PM, Ben Gamari
On Tue, 16 Aug 2011 12:32:13 -0400, Ben Gamari
wrote: It seems that the notmuch-haskell bindings (version 0.2.2 built against notmuch from git master; passes notmuch-test) aren't dealing with memory management properly. In particular, the attached test code[1] causes talloc to abort. Unfortunately, while the issue is consistently reproducible, it only occurs with some queries (see source[1]). I have been unable to establish the exact criterion for failure.
It seems that the crash is caused by an invalid access to a freed Query object while freeing a Messages object (see Valgrind trace[3]). I've taken a brief look at the bindings themselves but, being only minimally familiar with the FFI, there's nothing obviously wrong (the finalizers passed to newForeignPtr look sane). I was under the impression that talloc was reference counted, so the Query object shouldn't have been freed unless if there was still a Messages object holding a reference. Any idea what might have gone wrong here? Thanks!
After looking into this issue in a bit more depth, I'm even more confused. In fact, I would not be surprised if I have stumbled into a bug in the GC. It seems that the notmuch-haskell bindings follow the example of the python bindings in that child objects keep references to their parents to prevent the garbage collector from releasing the parent, which would in turn cause talloc to free the child objects, resulting in odd behavior when the child objects were next accessed. For instance, the Query and Messages objects are defined as follows,
type MessagesPtr = ForeignPtr S__notmuch_messages type MessagePtr = ForeignPtr S__notmuch_message newtype Query = Query (ForeignPtr S__notmuch_query) data MessagesRef = QueryMessages { qmpp :: Query, msp :: MessagesPtr } | ThreadMessages { tmpp :: Thread, msp :: MessagesPtr } | MessageMessages { mmspp :: Message, msp :: MessagesPtr } data Message = MessagesMessage { msmpp :: MessagesRef, mp :: MessagePtr } | Message { mp :: MessagePtr } type Messages = [Message]
One problem you might be running in to is that the optimization passes can notice that a function isn't using all of its arguments, and then it won't pass them. These even applies if the arguments are bound together in a record type. So if you have a record type:
data QueryResult = QR {qrQueryPtr :: ForeignPtr (), qrResultPointer :: Ptr ()}
and a function:
processQueryResult :: QueryResult -> IO (...)
If the function doesn't use the 'qrQueryPointer' part of the record, the compiler may not even pass it in. This might run the finalizer for the foreign pointer earlier than you expect. If the result pointer is a part of the query foreign pointer, you're in trouble. I'm not sure if this is what's happening, but it sounds like it could be. If this is the case you might want to build some helper functions using the function 'touchForeignPtr', which does nothing other than make it look like the foreign pointer is still in use. In my example it might be something like:
withQueryResultPtr :: QueryResult -> (Ptr QueryResult -> IO a) -> IO a withQueryResultPtr qr k = do x <- k (qrQueryPtr qr) touchForeignPtr (qrResultPointer qr) return x
Antoine

On Sun, 28 Aug 2011 22:26:05 -0500, Antoine Latter
One problem you might be running in to is that the optimization passes can notice that a function isn't using all of its arguments, and then it won't pass them. These even applies if the arguments are bound together in a record type.
In this case I wouldn't be able to reproduce the problem with optimization disabled, no? Unfortunately, this is not the case; the problem persists even with -O0. - Ben

On Sun, Aug 28, 2011 at 10:47 PM, Ben Gamari
On Sun, 28 Aug 2011 22:26:05 -0500, Antoine Latter
wrote: One problem you might be running in to is that the optimization passes can notice that a function isn't using all of its arguments, and then it won't pass them. These even applies if the arguments are bound together in a record type.
In this case I wouldn't be able to reproduce the problem with optimization disabled, no? Unfortunately, this is not the case; the problem persists even with -O0.
Perhaps? I don't know the details about how the GC decides when something is reachable. The scenario I described (which sounds similar to yours?) is only safe in Haskell when using functions like touchForeignPtr. Antoine
participants (2)
-
Antoine Latter
-
Ben Gamari