Broken ghc-7.0.3/vector combination?

Investigating the appearance of NaN in criterion's output, I found that NaNs were frequently introduced into the resample vectors when the resamples were sorted. Further investigation of the sorting code in vector-algorithms revealed no bugs there, and if the runtime was forced to keep a keen eye on the indices, by replacing unsafeRead/Write/Swap with their bounds-checked counterparts or by 'trace'ing enough of their uses, the NaNs did not appear. I could not reproduce the behaviour with ghc-7.0.1 (using exactly the same versions of the involved libraries), ghc-7.0.2 (different criterion release, the other libraries identical) or unoptimised compilation with 7.0.3 (no NaNs encountered in some 100+ testruns with varying input). So, is it possible that some change in ghc-7.0.3 vs. the previous versions caused a bad interaction between ghc-optimisations and vector fusion resulting in bad vector reads/writes?

Daniel Fischer wrote:
Further investigation of the sorting code in vector-algorithms revealed no bugs there, and if the runtime was forced to keep a keen eye on the indices, by replacing unsafeRead/Write/Swap with their bounds-checked counterparts or by 'trace'ing enough of their uses, the NaNs did not appear.
Did you replace them in vector-algorithms or in vector itself?
So, is it possible that some change in ghc-7.0.3 vs. the previous versions caused a bad interaction between ghc-optimisations and vector fusion resulting in bad vector reads/writes?
Am I right in assuming that this happens in code which uses only mutable vectors? Fusion only works for immutable ones so it shouldn't really affect things here. Have you tried playing around with code generation flags like -msse2? In any case, I would try to take a look at this if you tell me how to reproduce. Roman

On Wednesday 20 April 2011 19:11:07, Roman Leshchinskiy wrote:
Daniel Fischer wrote:
Further investigation of the sorting code in vector-algorithms revealed no bugs there, and if the runtime was forced to keep a keen eye on the indices, by replacing unsafeRead/Write/Swap with their bounds-checked counterparts or by 'trace'ing enough of their uses, the NaNs did not appear.
Did you replace them in vector-algorithms or in vector itself?
vector-algorithms only.
So, is it possible that some change in ghc-7.0.3 vs. the previous versions caused a bad interaction between ghc-optimisations and vector fusion resulting in bad vector reads/writes?
Am I right in assuming that this happens in code which uses only mutable vectors?
Yes, the sorting uses mutable vectors, in this case unboxed Double vectors.
Fusion only works for immutable ones so it shouldn't really affect things here.
Ah, didn't know that. Another suspect gone.
Have you tried playing around with code generation flags like -msse2?
No, not yet. So far only -O2 (with -fspec-constr-count=5 in the presence of many trace calls) and -O0.
In any case, I would try to take a look at this if you tell me how to reproduce.
I'll prepare a bundle, I'm afraid it won't be small, though. And it might be architecture dependent, so I can't guarantee that you will be able to reproduce it. But Bryan said on IRC yesterday that others have reported similar issues with criterion output, so it may well be cross-platform reproducible. Cheers, Daniel

On Wed, Apr 20, 2011 at 10:44 AM, Daniel Fischer < daniel.is.fischer@googlemail.com> wrote:
I'll prepare a bundle, I'm afraid it won't be small, though. And it might be architecture dependent, so I can't guarantee that you will be able to reproduce it. But Bryan said on IRC yesterday that others have reported similar issues with criterion output, so it may well be cross-platform reproducible.
Daniel, are you sure this is down to a 7.0.2/7.0.3 difference, and not perhaps due to just a bug in criterion itself?

On Wednesday 20 April 2011 20:25:34, Bryan O'Sullivan wrote:
On Wed, Apr 20, 2011 at 10:44 AM, Daniel Fischer <
daniel.is.fischer@googlemail.com> wrote:
I'll prepare a bundle, I'm afraid it won't be small, though. And it might be architecture dependent, so I can't guarantee that you will be able to reproduce it. But Bryan said on IRC yesterday that others have reported similar issues with criterion output, so it may well be cross-platform reproducible.
Daniel, are you sure this is down to a 7.0.2/7.0.3 difference, and not perhaps due to just a bug in criterion itself?
I'm sure it's not criterion, because after I've found that NaNs were introduced to the resamples vectors during sorting (check the entire vectors for NaNs before and aftersorting, tracing the count; before: 0, afterwards often quite a number, sometimes close to 10%), the further tests didn't involve criterion anymore. criterion is simply the most obvious place to see the NaNs show up (with 5-10% NaNs among the resamples, it won't take too long to see one pop up). It could be a bug in statistics, but I'm pretty sure this one's not due to statistics either, since fiddling with vector-algorithms made the NaNs disappear - btw., Bryan, using the heap sort instead of introsort, I haven't found any NaNs in my tests, so temporarily switching the algorithm might cure the symptoms. Dan Doel and I spent not too little time scrutinising the vector-algorithms code without finding an issue. Also, replacing the unsafe access with bounds-checked access (apparently) eliminated the NaNs, and 7.0.1 and 7.0.2 didn't produce any in my tests, yet more points to believe that it's none of these packages producing the behaviour, but rather something that changed between 7.0.2 and 7.0.3 -- however, so far in this matter my guesses as to what's responsible have been wrong, so I wouldn't be surprised if it's something entirely different.

On Wed, Apr 20, 2011 at 3:01 PM, Daniel Fischer
I'm sure it's not criterion, because after I've found that NaNs were introduced to the resamples vectors during sorting (check the entire vectors for NaNs before and aftersorting, tracing the count; before: 0, afterwards often quite a number, sometimes close to 10%), the further tests didn't involve criterion anymore. criterion is simply the most obvious place to see the NaNs show up (with 5-10% NaNs among the resamples, it won't take too long to see one pop up).
It could be a bug in statistics, but I'm pretty sure this one's not due to statistics either, since fiddling with vector-algorithms made the NaNs disappear - btw., Bryan, using the heap sort instead of introsort, I haven't found any NaNs in my tests, so temporarily switching the algorithm might cure the symptoms.
It's not a statistics bug. I'm reproducing it here using just vector-algorithms. Fill a vector of size N with [N..1], and (intro) sort it, and you get NaNs. But only with -O or above. Without optimization it doesn't happen (and nothing seems to be reading/writing out of bounds, as I compiled vector with UnsafeChecks earlier and it didn't complain). Filling the vector with [1..N] also doesn't trigger the NaNs. [0,0..0] and [0,0..1] trigger it. I don't know what's going on yet. I have trouble believing it's a bug in vector-algorithms code, though, as I don't think I've written any RULEs (just INLINEs), and that's the one thing that comes to mind in library code that could cause a difference between -O0 and -O. So I'd tentatively suggest it's a vector, base or compiler bug. The above testing is on 64-bit windows running a 32-bit copy of GHC, for reference. My ability to investigate this will be a bit limited for the near future. If someone definitively tracks it down to bugs in my code, though, let me know, and I'll try and push a new release up on hackage. -- Dan

On Wednesday 20 April 2011 21:55:51, Dan Doel wrote:
It's not a statistics bug. I'm reproducing it here using just vector-algorithms.
Yep. Attached a simple testcasewhich reproduces it and uses only vector and vector-algorithms.
Fill a vector of size N with [N..1], and (intro) sort it, and you get NaNs. But only with -O or above.
However, for me the NaNs disappear with the -msse2 option.
Without optimization it doesn't happen (and nothing seems to be reading/writing out of bounds, as I compiled vector with UnsafeChecks earlier and it didn't complain).
Nor does it happen here with 7.0.2 or 7.0.1.
Filling the vector with [1..N] also doesn't trigger the NaNs. [0,0..0] and [0,0..1] trigger it.
I don't know what's going on yet. I have trouble believing it's a bug in vector-algorithms code, though, as I don't think I've written any RULEs (just INLINEs), and that's the one thing that comes to mind in library code that could cause a difference between -O0 and -O. So I'd tentatively suggest it's a vector, base or compiler bug.
The above testing is on 64-bit windows running a 32-bit copy of GHC, for reference.
32-bit linux here
My ability to investigate this will be a bit limited for the near future. If someone definitively tracks it down to bugs in my code, though, let me know, and I'll try and push a new release up on hackage.
-- Dan

I tried "ghc --make -fforce-recomp simpleTest.hs" with -O0 and -O1 and -O2 on OS X with 64-bit ghc-7.0.3 All versions ran without printing errors.

On Thursday 21 April 2011 17:18:47, Chris Kuklewicz wrote:
I tried "ghc --make -fforce-recomp simpleTest.hs" with -O0 and -O1 and -O2 on OS X with 64-bit ghc-7.0.3
All versions ran without printing errors.
I seem to recall that GHC produces sse2 code on x86_64. If that's correct, the effect probably won't be reproducible on that architecture, since it doesn't occur with -msse2 on x86 either (well, at least on my machine).

On Thu, Apr 21, 2011 at 10:43 AM, Daniel Fischer
On Thursday 21 April 2011 17:18:47, Chris Kuklewicz wrote:
I tried "ghc --make -fforce-recomp simpleTest.hs" with -O0 and -O1 and -O2 on OS X with 64-bit ghc-7.0.3
All versions ran without printing errors.
I seem to recall that GHC produces sse2 code on x86_64. If that's correct, the effect probably won't be reproducible on that architecture, since it doesn't occur with -msse2 on x86 either (well, at least on my machine).
This is GHC 7.0.3 on Windows XP 32-bit: $ ghc --version The Glorious Glasgow Haskell Compilation System, version 7.0.3 $ ls cabal-dev bin cabal.config doc logs packages packages-7.0.3.conf primitive-0.3.1 vector-0.7.0.1 vector-algorithms-0.4 $ ./cabal-dev/bin/test.exe After sorting: 674 NaNs.

On Wed, Apr 20, 2011 at 05:02:50PM +0200, Daniel Fischer wrote:
So, is it possible that some change in ghc-7.0.3 vs. the previous versions
Very little changed between 7.0.2 and 7.0.3. The only thing that jumps out to me as possibly being relevant is: diff -ur 7.0.2/ghc-7.0.2/compiler/nativeGen/X86/Instr.hs 7.0.3/ghc-7.0.3/compiler/nativeGen/X86/Instr.hs --- 7.0.2/ghc-7.0.2/compiler/nativeGen/X86/Instr.hs 2011-02-28 18:10:06.000000000 +0000 +++ 7.0.3/ghc-7.0.3/compiler/nativeGen/X86/Instr.hs 2011-03-26 18:10:04.000000000 +0000 @@ -734,6 +734,7 @@ where p insn r = case insn of CALL _ _ -> GFREE : insn : r JMP _ -> GFREE : insn : r + JXX_GBL _ _ -> GFREE : insn : r _ -> insn : r Thanks Ian

On 20/04/2011 18:28, Ian Lynagh wrote:
On Wed, Apr 20, 2011 at 05:02:50PM +0200, Daniel Fischer wrote:
So, is it possible that some change in ghc-7.0.3 vs. the previous versions
Very little changed between 7.0.2 and 7.0.3. The only thing that jumps out to me as possibly being relevant is:
diff -ur 7.0.2/ghc-7.0.2/compiler/nativeGen/X86/Instr.hs 7.0.3/ghc-7.0.3/compiler/nativeGen/X86/Instr.hs --- 7.0.2/ghc-7.0.2/compiler/nativeGen/X86/Instr.hs 2011-02-28 18:10:06.000000000 +0000 +++ 7.0.3/ghc-7.0.3/compiler/nativeGen/X86/Instr.hs 2011-03-26 18:10:04.000000000 +0000 @@ -734,6 +734,7 @@ where p insn r = case insn of CALL _ _ -> GFREE : insn : r JMP _ -> GFREE : insn : r + JXX_GBL _ _ -> GFREE : insn : r _ -> insn : r
Right, it could be related to this. However this change was made to eliminate some causes of NaNs, see: http://hackage.haskell.org/trac/ghc/ticket/4914 So I'm very depressed if it managed to introduce NaNs somehow. Could someone make a ticket for this, with the smallest test case found so far please? Cheers, Simon

On Thu, Apr 21, 2011 at 8:08 AM, Simon Marlow
Right, it could be related to this. However this change was made to eliminate some causes of NaNs, see:
http://hackage.haskell.org/trac/ghc/ticket/4914
So I'm very depressed if it managed to introduce NaNs somehow.
Could someone make a ticket for this, with the smallest test case found so far please?
So in principle the LLVM backend should be fine? Thanks, -- Felipe.

On 21/04/2011 12:29, Felipe Almeida Lessa wrote:
On Thu, Apr 21, 2011 at 8:08 AM, Simon Marlow
wrote: Right, it could be related to this. However this change was made to eliminate some causes of NaNs, see:
http://hackage.haskell.org/trac/ghc/ticket/4914
So I'm very depressed if it managed to introduce NaNs somehow.
Could someone make a ticket for this, with the smallest test case found so far please?
So in principle the LLVM backend should be fine?
Yes, also compiling with -msse2 on x86 should be fine. Cheers, Simon

On Thursday 21 April 2011 13:08:22, Simon Marlow wrote:
On 20/04/2011 18:28, Ian Lynagh wrote:
On Wed, Apr 20, 2011 at 05:02:50PM +0200, Daniel Fischer wrote:
So, is it possible that some change in ghc-7.0.3 vs. the previous versions
Very little changed between 7.0.2 and 7.0.3. The only thing that jumps out to me as possibly being relevant is:
diff -ur 7.0.2/ghc-7.0.2/compiler/nativeGen/X86/Instr.hs 7.0.3/ghc-7.0.3/compiler/nativeGen/X86/Instr.hs --- 7.0.2/ghc-7.0.2/compiler/nativeGen/X86/Instr.hs 2011-02-28 18:10:06.000000000 +0000 +++ 7.0.3/ghc-7.0.3/compiler/nativeGen/X86/Instr.hs 2011-03-26 18:10:04.000000000 +0000 @@ -734,6 +734,7 @@
where p insn r = case insn of
CALL _ _ -> GFREE : insn : r JMP _ -> GFREE : insn : r
+ JXX_GBL _ _ -> GFREE : insn : r
_ -> insn : r
Right, it could be related to this.
I'm afraid it is. Comparing the dumped asm, after renaming identifiers, the only difference between the assembly produced by 7.0.2 and 7.0.3 is the appearance of 59 ffree %st(0) ;ffree %st(1) ;ffree %st(2) ;ffree %st(3) ffree %st(4) ;ffree %st(5) in 7.0.3's code which aren't in 7.0.2's.
However this change was made to eliminate some causes of NaNs, see:
http://hackage.haskell.org/trac/ghc/ticket/4914
So I'm very depressed if it managed to introduce NaNs somehow.
Could someone make a ticket for this, with the smallest test case found so far please?
http://hackage.haskell.org/trac/ghc/ticket/5149
Cheers, Simon
participants (9)
-
Bryan O'Sullivan
-
Chris Kuklewicz
-
Dan Doel
-
Daniel Fischer
-
Felipe Almeida Lessa
-
Ian Lynagh
-
Paulo Tanimoto
-
Roman Leshchinskiy
-
Simon Marlow