Re: [Msys2-users] Debugging undeterministic segfaults

On 4. 11. 2014 1:30, Ray Donnelly wrote:
Finally, can anyone else confirm the problem?
I'm sorry if I missed it, but I can't find what source version you're using, Gintautas. Release/trunk? -- David Macek

I'm working on ghc trunk.
You were indeed right, the compiler was probably optimizing out my code.
The suggested crasher code works, and qtcreator gets invoked, although I
did not manage to set up gdb yet.
I think I have an idea of what's going wrong here. hvr@ was right in
pointing out that we need to be careful with the PATH. It seems that the
bundled gcc is picking up the system-wide DLLs, and bad things happen
because of version incompatibilities. That does not explain why "rm" is
crashing, but maybe that's fallout from cross-process damage. I will do
some more testing. If this is indeed the cause, then hopefully debugging
will not be needed anyway.
On Tue, Nov 4, 2014 at 1:57 PM, David Macek
On 4. 11. 2014 1:30, Ray Donnelly wrote:
Finally, can anyone else confirm the problem?
I'm sorry if I missed it, but I can't find what source version you're using, Gintautas. Release/trunk?
-- David Macek _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
-- Gintautas Miliauskas

On 4. 11. 2014 14:14, Gintautas Miliauskas wrote:
I'm working on ghc trunk.
I'm trying to reproduce your errors, but I failed at ./boot with: Booting . 'autoreconf' is not recognized as an internal or external command, operable program or batch file. Running autoreconf failed with exitcode 256 at ./boot line 163, <PKGS> line 12. It seems that /mingw64/bin/perl's system("autoreconf") fails to execute because it's passing the command line to cmd, not bash (/usr/bin/autoreconf is a script). Gintautas, do you have mingw-w64-x86_64-perl installed? Can we do something about this, or is boot going to work only in pure msys2 shell? -- David Macek

I'm using /usr/bin/perl, and don't have the mingw perl installed.
On Tue, Nov 4, 2014 at 4:10 PM, David Macek
On 4. 11. 2014 14:14, Gintautas Miliauskas wrote:
I'm working on ghc trunk.
I'm trying to reproduce your errors, but I failed at ./boot with:
Booting . 'autoreconf' is not recognized as an internal or external command, operable program or batch file. Running autoreconf failed with exitcode 256 at ./boot line 163, <PKGS> line 12.
It seems that /mingw64/bin/perl's system("autoreconf") fails to execute because it's passing the command line to cmd, not bash (/usr/bin/autoreconf is a script).
Gintautas, do you have mingw-w64-x86_64-perl installed?
Can we do something about this, or is boot going to work only in pure msys2 shell?
-- David Macek _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
-- Gintautas Miliauskas

Hi. I just built GHC from master (1c35f9f1cb7a293da85d649904ce731a65824cfe) in my somewhat outdated MSYS2. I followed the wiki page with a few exceptions. - I cleared my PATH before running the shell (I left only Windows and System32) - my installation is not up-to-date - I do not have msys2 libtool, automake nor binutils; if the build used any of those, they came from mingw64 or from the host ghc - I had to run boot in pure msys2 shell, because mingw64 perl caused it to fail I saw no segfaults, but I may have missed them. I did not get a ghc.exe, but that may be correct behavior for all I know. My simple test program compiled and ran fine. I saw a lot of warnings during ghc's build though: - checking for DocBook DTD... I/O error : Attempt to load network entity http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd - something with not finding any implementation of a module out of [ xxx.dylib, xxx, ... ], I think it was in cabal builds - "could not find link destionations for: xxx" I hope it helps somehow. Maybe your issues come from mixing msys2 and mingw toolchain after all. -- David Macek

Hi David,
ghc should appear as inplace/bin/ghc-stage2.exe after a successful build.
The warnings are expected.
Did you run make with parallelism? I don't have a smoking gun, but the
build seems to be somewhat stable with -j1, while it crashes a lot of the
time with -j5 (I have a 4-core CPU). I have only tried a couple of runs
with -j1 (takes a while...), so I can't say for sure that non-parallel
builds are stable, but 2/2 runs succeeded.
Another data point: I ran the validate script in a loop and stored the
logs, and most crashes seem to be in rts/, but not all of them. Not sure
why.
$ grep Segmentation *.log
1.log:make[1]: ***
[libraries/base/dist-install/build/Text/Show/Functions.o] Segmentation fault
10.log:make[1]: *** [rts/dist/build/Hpc.o] Segmentation fault
11.log:make[1]: *** [rts/dist/build/RtsFlags.thr_l_o] Segmentation fault
12.log:make[1]: *** [rts/dist/build/sm/GCAux.o] Segmentation fault
13.log:make[1]: *** [rts/dist/build/win32/GetEnv.thr_l_o] Segmentation fault
14.log:make[1]: *** [rts/dist/build/sm/Scav.l_o] Segmentation fault
15.log:make[1]: *** [compiler/stage1/build/RegAlloc/Linear/State.o]
Segmentation fault
18.log:make[1]: ***
[libraries/filepath/dist-install/build/.depend-v.haskell] Segmentation fault
19.log:make[1]: *** [libraries/base/dist-install/build/.depend-v.haskell]
Segmentation fault
4.log:make[1]: *** [rts/dist/build/RtsDllMain.o] Segmentation fault
5.log:make[1]: *** [rts/dist/build/sm/Evac_thr.thr_o] Segmentation fault
6.log:make[1]: *** [rts/dist/build/sm/Scav_thr.thr_l_o] Segmentation fault
7.log:make[1]: *** [rts/dist/build/Linker.thr_debug_o] Segmentation fault
8.log:make[1]: *** [rts/dist/build/sm/Storage.debug_o] Segmentation fault
9.log:make[1]: *** [rts/dist/build/hooks/OutOfHeap.thr_debug_o]
Segmentation fault
On Tue, Nov 4, 2014 at 7:43 PM, David Macek
Hi. I just built GHC from master (1c35f9f1cb7a293da85d649904ce731a65824cfe) in my somewhat outdated MSYS2. I followed the wiki page with a few exceptions.
- I cleared my PATH before running the shell (I left only Windows and System32) - my installation is not up-to-date - I do not have msys2 libtool, automake nor binutils; if the build used any of those, they came from mingw64 or from the host ghc - I had to run boot in pure msys2 shell, because mingw64 perl caused it to fail
I saw no segfaults, but I may have missed them. I did not get a ghc.exe, but that may be correct behavior for all I know. My simple test program compiled and ran fine. I saw a lot of warnings during ghc's build though:
- checking for DocBook DTD... I/O error : Attempt to load network entity http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd - something with not finding any implementation of a module out of [ xxx.dylib, xxx, ... ], I think it was in cabal builds - "could not find link destionations for: xxx"
I hope it helps somehow. Maybe your issues come from mixing msys2 and mingw toolchain after all.
-- David Macek
------------------------------------------------------------------------------ _______________________________________________ Msys2-users mailing list Msys2-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/msys2-users
-- Gintautas Miliauskas

On 4. 11. 2014 23:20, Gintautas Miliauskas wrote:
ghc should appear as inplace/bin/ghc-stage2.exe after a successful build.
It's there.
Did you run make with parallelism? I don't have a smoking gun, but the build seems to be somewhat stable with -j1, while it crashes a lot of the time with -j5 (I have a 4-core CPU). I have only tried a couple of runs with -j1 (takes a while...), so I can't say for sure that non-parallel builds are stable, but 2/2 runs succeeded.
Nope. I'll try with -j5. -- David Macek

Oh, and David, thanks for your help. It's really appreciated. This issue
has been driving me nuts recently, and I don't have a good strategy to root
it out...
On Tue, Nov 4, 2014 at 11:23 PM, David Macek
On 4. 11. 2014 23:20, Gintautas Miliauskas wrote:
ghc should appear as inplace/bin/ghc-stage2.exe after a successful build.
It's there.
Did you run make with parallelism? I don't have a smoking gun, but the build seems to be somewhat stable with -j1, while it crashes a lot of the time with -j5 (I have a 4-core CPU). I have only tried a couple of runs with -j1 (takes a while...), so I can't say for sure that non-parallel builds are stable, but 2/2 runs succeeded.
Nope. I'll try with -j5.
-- David Macek _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
-- Gintautas Miliauskas

On 4. 11. 2014 23:23, David Macek wrote:
Nope. I'll try with -j5.
So that looks like another successful build. Unless "make" can ignore the -j argument, I'd say the issue is caused or activated by your configuration. I'm running validate to double check (detected 4 CPUs). Maybe we should work out a precise, minimalistic recipe to replicate the issue (I haven't tried installing clean MSYS2 yet). By the way, have you ruled out anti-virus software (and other BLODApps) as a possible cause? -- David Macek

Sorry for the large amount of messages. On 5. 11. 2014 8:01, David Macek wrote:
I'm running validate to double check (detected 4 CPUs).
I got the validate results:
Unexpected results from: TEST="linker_unload listCommand002 T5681 T5486 T7571 ghcpkg05 T3924 T7702 plugins01 T6106 ghci038 T8172 ghci032 T5975a T5975b ghci058 T3064 T3307 environment001 T876 T3738 T4830 T5205 T7436 lazy-bs-alloc T1407 rdynamic T7037 T5423 T8124 T5435_dyn_asm prog012 prog013 prog001 prog002 prog003 T4006"
OVERALL SUMMARY for test run started at 11/05/14 09:04:01 Central Europe Standard Time 1:01:50 spent to go through 4095 total tests, which gave rise to 14911 test cases, of which 11167 were skipped
58 had missing libraries 3578 expected passes 71 expected failures
1 caused framework failures 1 unexpected passes 36 unexpected failures
Unexpected passes: rts linker_unload (normal)
Unexpected failures: ../../libraries/base/tests T4006 [bad stdout] (normal) ../../libraries/base/tests/IO T3307 [bad exit code] (normal) ../../libraries/base/tests/IO environment001 [bad stdout] (normal) cabal ghcpkg05 [bad stderr] (normal) callarity/perf T3924 [stat too good] (normal) ghci.debugger/scripts listCommand002 [bad stderr] (ghci) ghci/linking T1407 [bad stderr] (ghci) ghci/prog001 prog001 [bad stderr] (ghci) ghci/prog002 prog002 [bad stderr] (ghci) ghci/prog003 prog003 [bad exit code] (ghci) ghci/prog012 prog012 [bad stderr] (ghci) ghci/prog013 prog013 [bad stderr] (ghci) ghci/scripts T5975a [bad stderr] (ghci) ghci/scripts T5975b [bad stderr] (ghci) ghci/scripts T6106 [bad stderr] (ghci) ghci/scripts T8172 [bad stdout] (ghci) ghci/scripts ghci032 [bad stderr] (ghci) ghci/scripts ghci038 [bad stderr] (ghci) ghci/scripts ghci058 [bad stderr] (ghci) llvm/should_compile T5486 [stderr mismatch] (optllvm) llvm/should_compile T5681 [stderr mismatch] (optllvm) llvm/should_compile T7571 [stderr mismatch] (optllvm) perf/compiler T3064 [stat not good enough] (normal) perf/should_run T3738 [stat too good] (normal) perf/should_run T4830 [stat too good] (normal) perf/should_run T5205 [stat too good] (normal) perf/should_run T7436 [stat too good] (normal) perf/should_run T876 [stat not good enough] (normal) perf/should_run lazy-bs-alloc [stat not good enough] (normal) plugins plugins01 [bad stderr] (normal) rts T5423 [bad stdout] (normal) rts T5435_dyn_asm [bad stdout] (normal) rts T7037 [bad stdout] (normal) rts T8124 [exit code non-0] (threaded1) rts rdynamic [bad exit code] (normal) simplCore/should_compile T7702 [stderr mismatch] (normal)
I assume that means the build itself had no errors. -- David Macek

Hi David, more messages means more progress on this issue, so fire away. Your validate output looks reasonable, some tests are known to fail on Windows. I just verified that at least here the build seems to be stable with -j1: 40/40 builds were successful. Great! Now we know that something is up with multithreaded make. Thanks for pointing out that virus scanners could be an issue. I found that Microsoft Security Essentials realtime scanning was on. I'll try disabling it and see if that helps with the -j5 case.
So that looks like another successful build. Unless "make" can ignore the -j argument, I'd say the issue is caused or activated by your configuration.
I would be very happy if that were the case, but there were reports about instabilities on msys2 from others too. Herbert wrote:
PS: Fwiw, It seems to me the Cygwin environment seems more reliable compared to the Msys2 environment. On Cygwin I never had any aborted GHC builds, while on Msys2 it seems to happen from time to time (but non-deterministic, and rather seldom)
Herbert, were you running make with -jN, N >1?
On Wed, Nov 5, 2014 at 11:36 AM, David Macek
Sorry for the large amount of messages.
On 5. 11. 2014 8:01, David Macek wrote:
I'm running validate to double check (detected 4 CPUs).
I got the validate results:
Unexpected results from: TEST="linker_unload listCommand002 T5681 T5486 T7571 ghcpkg05 T3924 T7702 plugins01 T6106 ghci038 T8172 ghci032 T5975a T5975b ghci058 T3064 T3307 environment001 T876 T3738 T4830 T5205 T7436 lazy-bs-alloc T1407 rdynamic T7037 T5423 T8124 T5435_dyn_asm prog012 prog013 prog001 prog002 prog003 T4006"
OVERALL SUMMARY for test run started at 11/05/14 09:04:01 Central Europe Standard Time 1:01:50 spent to go through 4095 total tests, which gave rise to 14911 test cases, of which 11167 were skipped
58 had missing libraries 3578 expected passes 71 expected failures
1 caused framework failures 1 unexpected passes 36 unexpected failures
Unexpected passes: rts linker_unload (normal)
Unexpected failures: ../../libraries/base/tests T4006 [bad stdout] (normal) ../../libraries/base/tests/IO T3307 [bad exit code] (normal) ../../libraries/base/tests/IO environment001 [bad stdout] (normal) cabal ghcpkg05 [bad stderr] (normal) callarity/perf T3924 [stat too good] (normal) ghci.debugger/scripts listCommand002 [bad stderr] (ghci) ghci/linking T1407 [bad stderr] (ghci) ghci/prog001 prog001 [bad stderr] (ghci) ghci/prog002 prog002 [bad stderr] (ghci) ghci/prog003 prog003 [bad exit code] (ghci) ghci/prog012 prog012 [bad stderr] (ghci) ghci/prog013 prog013 [bad stderr] (ghci) ghci/scripts T5975a [bad stderr] (ghci) ghci/scripts T5975b [bad stderr] (ghci) ghci/scripts T6106 [bad stderr] (ghci) ghci/scripts T8172 [bad stdout] (ghci) ghci/scripts ghci032 [bad stderr] (ghci) ghci/scripts ghci038 [bad stderr] (ghci) ghci/scripts ghci058 [bad stderr] (ghci) llvm/should_compile T5486 [stderr mismatch] (optllvm) llvm/should_compile T5681 [stderr mismatch] (optllvm) llvm/should_compile T7571 [stderr mismatch] (optllvm) perf/compiler T3064 [stat not good enough] (normal) perf/should_run T3738 [stat too good] (normal) perf/should_run T4830 [stat too good] (normal) perf/should_run T5205 [stat too good] (normal) perf/should_run T7436 [stat too good] (normal) perf/should_run T876 [stat not good enough] (normal) perf/should_run lazy-bs-alloc [stat not good enough] (normal) plugins plugins01 [bad stderr] (normal) rts T5423 [bad stdout] (normal) rts T5435_dyn_asm [bad stdout] (normal) rts T7037 [bad stdout] (normal) rts T8124 [exit code non-0] (threaded1) rts rdynamic [bad exit code] (normal) simplCore/should_compile T7702 [stderr mismatch] (normal)
I assume that means the build itself had no errors.
-- David Macek _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
-- Gintautas Miliauskas

Thanks for pointing out that virus scanners could be an issue. I found that Microsoft Security Essentials realtime scanning was on. I'll try disabling it and see if that helps with the -j5 case.
For what it's worth, I tried disabling the virus scanner, but it did not help, 4/8 validation runs segfaulted (-j5). -- Gintautas Miliauskas

On 5. 11. 2014 18:13, Gintautas Miliauskas wrote:
Thanks for pointing out that virus scanners could be an issue. I found that Microsoft Security Essentials realtime scanning was on. I'll try disabling it and see if that helps with the -j5 case.
For what it's worth, I tried disabling the virus scanner, but it did not help, 4/8 validation runs segfaulted (-j5).
Can you dump your package versions here? Use pacman -Qe. I want to try the build with a replica of your environment. Also, does 4/8 mean that some builds were without errors? Maybe I haven't done enough runs. Could you attach the script you use for running validate in a loop? (I'm sure it's simple enough for me to write it, but if I can avoid it...) -- David Macek

Hey,
I'm on vacation right now without access to my workstation, I will be back
in a couple weeks.
I believe hvr@ was having some stability issues too, maybe he can help
reproduce the problem?
Thanks for looking into this.
--
Gintautas Miliauskas
On Nov 10, 2014 6:24 PM, "David Macek"
On 5. 11. 2014 18:13, Gintautas Miliauskas wrote:
Thanks for pointing out that virus scanners could be an issue. I
found that Microsoft Security Essentials realtime scanning was on. I'll try disabling it and see if that helps with the -j5 case.
For what it's worth, I tried disabling the virus scanner, but it did not
help, 4/8 validation runs segfaulted (-j5).
Can you dump your package versions here? Use pacman -Qe. I want to try the build with a replica of your environment.
Also, does 4/8 mean that some builds were without errors? Maybe I haven't done enough runs. Could you attach the script you use for running validate in a loop? (I'm sure it's simple enough for me to write it, but if I can avoid it...)
-- David Macek _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

I think I have an idea of what's going wrong here. hvr@ was right in pointing out that we need to be careful with the PATH. It seems that the bundled gcc is picking up the system-wide DLLs, and bad things happen because of version incompatibilities. That does not explain why "rm" is crashing, but maybe that's fallout from cross-process damage. I will do some more testing. If this is indeed the cause, then hopefully debugging will not be needed anyway.
Update: even after setting PATH to have the embedded gcc path in the first position to make sure that the right DLLs are, I still got a few segfaults, so this is probably not it. -- Gintautas Miliauskas

On 04/11/2014 15:44, Gintautas Miliauskas wrote:
I think I have an idea of what's going wrong here. hvr@ was right in pointing out that we need to be careful with the PATH. It seems that the bundled gcc is picking up the system-wide DLLs, and bad things happen because of version incompatibilities. That does not explain why "rm" is crashing, but maybe that's fallout from cross-process damage. I will do some more testing. If this is indeed the cause, then hopefully debugging will not be needed anyway.
Update: even after setting PATH to have the embedded gcc path in the first position to make sure that the right DLLs are, I still got a few segfaults, so this is probably not it.
Is it always GHC that segfaults, or one of the other programs it invokes, like gcc? Have you tried to reproduce this on another machine, to rule out hardware problems? Cheers, Simon
participants (3)
-
David Macek
-
Gintautas Miliauskas
-
Simon Marlow