Bug in GC's ordering of ForeignPtr finalization?

On Tue, 16 Aug 2011 12:32:13 -0400, Ben Gamari
It seems that the notmuch-haskell bindings (version 0.2.2 built against notmuch from git master; passes notmuch-test) aren't dealing with memory management properly. In particular, the attached test code[1] causes talloc to abort. Unfortunately, while the issue is consistently reproducible, it only occurs with some queries (see source[1]). I have been unable to establish the exact criterion for failure.
It seems that the crash is caused by an invalid access to a freed Query object while freeing a Messages object (see Valgrind trace[3]). I've taken a brief look at the bindings themselves but, being only minimally familiar with the FFI, there's nothing obviously wrong (the finalizers passed to newForeignPtr look sane). I was under the impression that talloc was reference counted, so the Query object shouldn't have been freed unless if there was still a Messages object holding a reference. Any idea what might have gone wrong here? Thanks!
After looking into this issue in a bit more depth, I'm even more confused. In fact, I would not be surprised if I have stumbled into a bug in the GC. It seems that the notmuch-haskell bindings follow the example of the python bindings in that child objects keep references to their parents to prevent the garbage collector from releasing the parent, which would in turn cause talloc to free the child objects, resulting in odd behavior when the child objects were next accessed. For instance, the Query and Messages objects are defined as follows, type MessagesPtr = ForeignPtr S__notmuch_messages type MessagePtr = ForeignPtr S__notmuch_message newtype Query = Query (ForeignPtr S__notmuch_query) data MessagesRef = QueryMessages { qmpp :: Query, msp :: MessagesPtr } | ThreadMessages { tmpp :: Thread, msp :: MessagesPtr } | MessageMessages { mmspp :: Message, msp :: MessagesPtr } data Message = MessagesMessage { msmpp :: MessagesRef, mp :: MessagePtr } | Message { mp :: MessagePtr } type Messages = [Message] As seen in the Valgrind dump given in my previous message, it seems that the Query object is being freed before the Messages object. Since the Messages object is a child of the Query object, this fails. In my case, I'm calling queryMessages which begins by issuing a given notmuch Query, resulting in a MessagesPtr. This is then packaged into a QueryMessages object which is then passed off to unpackMessages. unpackMessages iterates over this collection, creating MessagesMessage objects which themselves refer to the QueryMessages object. Finally, these MessagesMessage objects are packed into a list, resulting in a Messages object. Thus we have the following chain of references, MessagesMessage | | msmpp \/ QueryMessages | | qmpp \/ Query As we can see, each MessagesMessage object in the Messages list resulting from queryMessages holds a reference to the Query object from which it originated. For this reason, I fail to see how it is possible that the RTS would attempt to free the Query before freeing the MessagesPtr. Did I miss something in my analysis? Are there tools for debugging issues such as this? Perhaps this is a bug in the GC? Any help at all would be greatly appreciated. Cheers, - Ben

On Sun, Aug 28, 2011 at 4:27 PM, Ben Gamari
On Tue, 16 Aug 2011 12:32:13 -0400, Ben Gamari
wrote: It seems that the notmuch-haskell bindings (version 0.2.2 built against notmuch from git master; passes notmuch-test) aren't dealing with memory management properly. In particular, the attached test code[1] causes talloc to abort. Unfortunately, while the issue is consistently reproducible, it only occurs with some queries (see source[1]). I have been unable to establish the exact criterion for failure.
It seems that the crash is caused by an invalid access to a freed Query object while freeing a Messages object (see Valgrind trace[3]). I've taken a brief look at the bindings themselves but, being only minimally familiar with the FFI, there's nothing obviously wrong (the finalizers passed to newForeignPtr look sane). I was under the impression that talloc was reference counted, so the Query object shouldn't have been freed unless if there was still a Messages object holding a reference. Any idea what might have gone wrong here? Thanks!
After looking into this issue in a bit more depth, I'm even more confused. In fact, I would not be surprised if I have stumbled into a bug in the GC. It seems that the notmuch-haskell bindings follow the example of the python bindings in that child objects keep references to their parents to prevent the garbage collector from releasing the parent, which would in turn cause talloc to free the child objects, resulting in odd behavior when the child objects were next accessed. For instance, the Query and Messages objects are defined as follows,
type MessagesPtr = ForeignPtr S__notmuch_messages type MessagePtr = ForeignPtr S__notmuch_message newtype Query = Query (ForeignPtr S__notmuch_query) data MessagesRef = QueryMessages { qmpp :: Query, msp :: MessagesPtr } | ThreadMessages { tmpp :: Thread, msp :: MessagesPtr } | MessageMessages { mmspp :: Message, msp :: MessagesPtr } data Message = MessagesMessage { msmpp :: MessagesRef, mp :: MessagePtr } | Message { mp :: MessagePtr } type Messages = [Message]
One problem you might be running in to is that the optimization passes can notice that a function isn't using all of its arguments, and then it won't pass them. These even applies if the arguments are bound together in a record type. So if you have a record type:
data QueryResult = QR {qrQueryPtr :: ForeignPtr (), qrResultPointer :: Ptr ()}
and a function:
processQueryResult :: QueryResult -> IO (...)
If the function doesn't use the 'qrQueryPointer' part of the record, the compiler may not even pass it in. This might run the finalizer for the foreign pointer earlier than you expect. If the result pointer is a part of the query foreign pointer, you're in trouble. I'm not sure if this is what's happening, but it sounds like it could be. If this is the case you might want to build some helper functions using the function 'touchForeignPtr', which does nothing other than make it look like the foreign pointer is still in use. In my example it might be something like:
withQueryResultPtr :: QueryResult -> (Ptr QueryResult -> IO a) -> IO a withQueryResultPtr qr k = do x <- k (qrQueryPtr qr) touchForeignPtr (qrResultPointer qr) return x
Antoine

On Sun, 28 Aug 2011 22:26:05 -0500, Antoine Latter
One problem you might be running in to is that the optimization passes can notice that a function isn't using all of its arguments, and then it won't pass them. These even applies if the arguments are bound together in a record type.
In this case I wouldn't be able to reproduce the problem with optimization disabled, no? Unfortunately, this is not the case; the problem persists even with -O0. - Ben

On Sun, Aug 28, 2011 at 10:47 PM, Ben Gamari
On Sun, 28 Aug 2011 22:26:05 -0500, Antoine Latter
wrote: One problem you might be running in to is that the optimization passes can notice that a function isn't using all of its arguments, and then it won't pass them. These even applies if the arguments are bound together in a record type.
In this case I wouldn't be able to reproduce the problem with optimization disabled, no? Unfortunately, this is not the case; the problem persists even with -O0.
Perhaps? I don't know the details about how the GC decides when something is reachable. The scenario I described (which sounds similar to yours?) is only safe in Haskell when using functions like touchForeignPtr. Antoine

Dear Ben, Ben Gamari wrote:
After looking into this issue in a bit more depth, I'm even more confused. In fact, I would not be surprised if I have stumbled into a bug in the GC. [...] MessagesMessage | | msmpp \/ QueryMessages | | qmpp \/ Query
As we can see, each MessagesMessage object in the Messages list resulting from queryMessages holds a reference to the Query object from which it originated. For this reason, I fail to see how it is possible that the RTS would attempt to free the Query before freeing the MessagesPtr.
When a garbage collection is performed, the RTS determines which heap objects are still reachable. The rest is then freed _simultaneously_, and the corresponding finalizers are run in some random order. So assuming the application holds a reference to the MessagesMessage object for a while and then drops it, the GC will detect unreachability of all the three objects at the same time and in the end, the finalizer for MessagesMessage may be run before that of Query. So I think this is not a bug. To solve this problem properly, libnotmuch should stop imposing order constraints on when objects are freed - this would mean tracking references using talloc_ref and talloc_unlink instead of talloc_free inside the library. For a bindings author who does not want to touch the library, the best idea I have is to add a layer with the sole purpose of tracking those implicit references. Best regards, Bertram
participants (3)
-
Antoine Latter
-
Ben Gamari
-
Bertram Felgenhauer