
Hi there! as some of you may know, I've been working on an aarch64 native code generator. Now I've hit a situation where my stage2 compiler somehow corrupts my heap. Initially I thought this would likely be missing memory barriers, however they are emitted. This doesn't mean it can't be, but at least it's not as simple as "they are just missing". The crashes I see are non deterministic, in fact I sometimes even manage to compile a Hello World module, without crashes. Other times it crashes with unknown closure errors or it just crashes. But it always crashes during GC. Changing the nursery size make it crasha bit more frequent, but nothing obvious sticks out yet. If anyone has some create ideas, I'd love to hear them. I've been wondering if just logging allocations (offset, range, type) would help figuring out what we expected to be there; and then maybe try to break on the allocation, (and subsequent writes). I'm sure some have been down this road before. Cheers, Moritz

On 31 Aug 2020, at 5:54 pm, Moritz Angermann
wrote: If anyone has some create ideas, I'd love to hear them. I've been wondering if just logging allocations (offset, range, type) would help figuring out what we expected to be there; and then maybe try to break on the allocation, (and subsequent writes).
I'm sure some have been down this road before.
Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode. The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can. A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly. http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html https://hackage.haskell.org/package/packman Ben.

Dump the whole heap into file during GC traversal or taking the whole
allocated area. hmm, maybe this is the same as core dump.
On Mon, Aug 31, 2020 at 11:00 AM Ben Lippmeier
On 31 Aug 2020, at 5:54 pm, Moritz Angermann
wrote: If anyone has some create ideas, I'd love to hear them. I've been wondering if just logging allocations (offset, range, type) would help figuring out what we expected to be there; and then maybe try to break on the allocation, (and subsequent writes).
I'm sure some have been down this road before.
Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.
The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.
A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.
http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html https://hackage.haskell.org/package/packman
Ben.
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

I assume you're familiar with the following from
https://www.aosabook.org/en/ghc.html and that this facility is still there.
Just in case you are not:
So, the debug RTS has an optional mode that we call *sanity checking*.
Sanity checking enables all kinds of expensive assertions, and can make the
program run many times more slowly. In particular, sanity checking runs a
full scan of the heap to check for dangling pointers (amongst other
things), before *and* after every GC. The first job when investigating a
runtime crash is to run the program with sanity checking turned on;
sometimes this will catch the invariant violation well before the program
actually crashes.
On Mon, Aug 31, 2020 at 11:08 AM Csaba Hruska
Dump the whole heap into file during GC traversal or taking the whole allocated area. hmm, maybe this is the same as core dump.
On Mon, Aug 31, 2020 at 11:00 AM Ben Lippmeier
wrote: On 31 Aug 2020, at 5:54 pm, Moritz Angermann < moritz.angermann@gmail.com> wrote:
If anyone has some create ideas, I'd love to hear them. I've been wondering if just logging allocations (offset, range, type) would help figuring out what we expected to be there; and then maybe try to break on the allocation, (and subsequent writes).
I'm sure some have been down this road before.
Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.
The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.
A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.
http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html https://hackage.haskell.org/package/packman
Ben.
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

+Moritz
On Mon, Aug 31, 2020 at 11:17 AM George Colpitts
I assume you're familiar with the following from https://www.aosabook.org/en/ghc.html and that this facility is still there. Just in case you are not:
So, the debug RTS has an optional mode that we call *sanity checking*. Sanity checking enables all kinds of expensive assertions, and can make the program run many times more slowly. In particular, sanity checking runs a full scan of the heap to check for dangling pointers (amongst other things), before *and* after every GC. The first job when investigating a runtime crash is to run the program with sanity checking turned on; sometimes this will catch the invariant violation well before the program actually crashes.
On Mon, Aug 31, 2020 at 11:08 AM Csaba Hruska
wrote: Dump the whole heap into file during GC traversal or taking the whole allocated area. hmm, maybe this is the same as core dump.
On Mon, Aug 31, 2020 at 11:00 AM Ben Lippmeier
wrote: On 31 Aug 2020, at 5:54 pm, Moritz Angermann < moritz.angermann@gmail.com> wrote:
If anyone has some create ideas, I'd love to hear them. I've been wondering if just logging allocations (offset, range, type) would help figuring out what we expected to be there; and then maybe try to break on the allocation, (and subsequent writes).
I'm sure some have been down this road before.
Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.
The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.
A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.
http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html https://hackage.haskell.org/package/packman
Ben.
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Fuzzing:
1. generate simple random stg programs
2. compile and run with RTS sanity checking enabled
3. compare the program result between different backends
The fuzzer should cover all codegen cases and all code in RTS. Maybe this
could be checked by the existing tools.
On Mon, Aug 31, 2020 at 4:19 PM George Colpitts
+Moritz
On Mon, Aug 31, 2020 at 11:17 AM George Colpitts < george.colpitts@gmail.com> wrote:
I assume you're familiar with the following from https://www.aosabook.org/en/ghc.html and that this facility is still there. Just in case you are not:
So, the debug RTS has an optional mode that we call *sanity checking*. Sanity checking enables all kinds of expensive assertions, and can make the program run many times more slowly. In particular, sanity checking runs a full scan of the heap to check for dangling pointers (amongst other things), before *and* after every GC. The first job when investigating a runtime crash is to run the program with sanity checking turned on; sometimes this will catch the invariant violation well before the program actually crashes.
On Mon, Aug 31, 2020 at 11:08 AM Csaba Hruska
wrote: Dump the whole heap into file during GC traversal or taking the whole allocated area. hmm, maybe this is the same as core dump.
On Mon, Aug 31, 2020 at 11:00 AM Ben Lippmeier
wrote: On 31 Aug 2020, at 5:54 pm, Moritz Angermann < moritz.angermann@gmail.com> wrote:
If anyone has some create ideas, I'd love to hear them. I've been wondering if just logging allocations (offset, range, type) would help figuring out what we expected to be there; and then maybe try to break on the allocation, (and subsequent writes).
I'm sure some have been down this road before.
Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.
The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.
A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.
http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html https://hackage.haskell.org/package/packman
Ben.
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Thanks everyone. I have indeed been trying to get somewhere with sanity
checking. That used to help quite a bit for the deadstripping stuff that
happened on iOS a long time ago, but that was also much more deterministic.
Maybe I'll try to see if running it through qemu will give me some more
determinism. That at least gives somewhat predictable allocations. It could
still end up being some annoying memory ordering issues, the llvm backend
just managed to happen to not run into by luck, or optimisation passes.
On Mon, Aug 31, 2020 at 10:29 PM Csaba Hruska
Fuzzing:
1. generate simple random stg programs 2. compile and run with RTS sanity checking enabled 3. compare the program result between different backends
The fuzzer should cover all codegen cases and all code in RTS. Maybe this could be checked by the existing tools.
On Mon, Aug 31, 2020 at 4:19 PM George Colpitts
wrote: +Moritz
On Mon, Aug 31, 2020 at 11:17 AM George Colpitts < george.colpitts@gmail.com> wrote:
I assume you're familiar with the following from https://www.aosabook.org/en/ghc.html and that this facility is still there. Just in case you are not:
So, the debug RTS has an optional mode that we call *sanity checking*. Sanity checking enables all kinds of expensive assertions, and can make the program run many times more slowly. In particular, sanity checking runs a full scan of the heap to check for dangling pointers (amongst other things), before *and* after every GC. The first job when investigating a runtime crash is to run the program with sanity checking turned on; sometimes this will catch the invariant violation well before the program actually crashes.
On Mon, Aug 31, 2020 at 11:08 AM Csaba Hruska
wrote: Dump the whole heap into file during GC traversal or taking the whole allocated area. hmm, maybe this is the same as core dump.
On Mon, Aug 31, 2020 at 11:00 AM Ben Lippmeier
wrote: On 31 Aug 2020, at 5:54 pm, Moritz Angermann < moritz.angermann@gmail.com> wrote:
If anyone has some create ideas, I'd love to hear them. I've been wondering if just logging allocations (offset, range, type) would help figuring out what we expected to be there; and then maybe try to break on the allocation, (and subsequent writes).
I'm sure some have been down this road before.
Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.
The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.
A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.
http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html https://hackage.haskell.org/package/packman
Ben.
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Ben Lippmeier
On 31 Aug 2020, at 5:54 pm, Moritz Angermann
wrote: If anyone has some create ideas, I'd love to hear them. I've been wondering if just logging allocations (offset, range, type) would help figuring out what we expected to be there; and then maybe try to break on the allocation, (and subsequent writes).
I'm sure some have been down this road before.
Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.
The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.
Indeed. Small nurseries (using +RTS -A), deterministic GC behavior (with +RTS -V0 -I0), and sanity checking (with +RTS -DS) are all a very useful for this.
A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.
http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html https://hackage.haskell.org/package/packman
For what it's worth, the ghc-debug [1] project which Sven Tennie, Matt Pickering, and I have been working on over the last year or so was in part motivated by precisely this use-case. It would allow the heap of one Haskell process's heap to be traversed by another process. This is useful for both debugging and profiling use-cases. Cheers, - Ben [1] https://github.com/bgamari/ghc-debug
participants (5)
-
Ben Gamari
-
Ben Lippmeier
-
Csaba Hruska
-
George Colpitts
-
Moritz Angermann