
Hi Everybody, while working on my resent project I've noticed that my code seems to be faster under Windows than under Linux x64. More exactly this was an AI game evaluator that ran on given parameters. There was no IO performed. I've run 3 lots of test on both systems and stored some figures. It was physicaly the same PC. 1st lot WinXP total time = 27.18 secs (1359 ticks @ 20 ms) total alloc = 5,788,242,604 bytes (excludes profiling overheads) Linux total time = 34.44 secs (1722 ticks @ 20 ms) total alloc = 11,897,757,176 bytes (excludes profiling overheads) 2nd lot WinXP total time = 63.96 secs (3198 ticks @ 20 ms) total alloc = 13,205,507,148 bytes (excludes profiling overheads) Linux total time = 80.76 secs (4038 ticks @ 20 ms) total alloc = 27,258,694,888 bytes (excludes profiling overheads) 3rd lot WinXP total time = 207.10 secs (10355 ticks @ 20 ms) total alloc = 44,982,716,780 bytes (excludes profiling overheads) Linux total time = 267.58 secs (13379 ticks @ 20 ms) total alloc = 92,307,482,416 bytes (excludes profiling overheads) I've used the same compile and runtime options for both. I've tried to run with -H option, but this didn't improve anything. Is this common behaviour? Does anybody know what can be the reason? regards, Bartek

bartek:
Hi Everybody,
while working on my resent project I've noticed that my code seems to be faster under Windows than under Linux x64. More exactly this was an AI game evaluator that ran on given parameters. There was no IO performed. I've run 3 lots of test on both systems and stored some figures. It was physicaly the same PC.
1st lot WinXP total time = 27.18 secs (1359 ticks @ 20 ms) total alloc = 5,788,242,604 bytes (excludes profiling overheads) Linux total time = 34.44 secs (1722 ticks @ 20 ms) total alloc = 11,897,757,176 bytes (excludes profiling overheads)
2nd lot WinXP total time = 63.96 secs (3198 ticks @ 20 ms) total alloc = 13,205,507,148 bytes (excludes profiling overheads) Linux total time = 80.76 secs (4038 ticks @ 20 ms) total alloc = 27,258,694,888 bytes (excludes profiling overheads)
3rd lot WinXP total time = 207.10 secs (10355 ticks @ 20 ms) total alloc = 44,982,716,780 bytes (excludes profiling overheads) Linux total time = 267.58 secs (13379 ticks @ 20 ms) total alloc = 92,307,482,416 bytes (excludes profiling overheads)
I've used the same compile and runtime options for both. I've tried to run with -H option, but this didn't improve anything. Is this common behaviour? Does anybody know what can be the reason?
Is Windows running in 32 bit? What gcc versions are you using on each system? -- Don

Hello Don, Tuesday, November 25, 2008, 1:59:02 AM, you wrote:
Is Windows running in 32 bit? What gcc versions are you using on each system?
there is no 64-bit ghc for windows yet, and i think that 64-bit windows runs 32-bit programs as fast as 32-bit windows this problem naturally splits into two parts: 32-bit vs 64-bit and linux vs windows -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Monday 24 November 2008 23:59:02 Don Stewart wrote:
bartek:
Hi Everybody,
while working on my resent project I've noticed that my code seems to be faster under Windows than under Linux x64.
Is Windows running in 32 bit? What gcc versions are you using on each system?
Windows is 32 bit with GHC-6.8.3. Linux is 64 bit with GHC-6.10.1. Bartek

Bartosz Wójcik
while working on my resent project I've noticed that my code seems to be faster under Windows than under Linux x64.
Is Windows running in 32 bit? What gcc versions are you using on each system?
Windows is 32 bit with GHC-6.8.3. Linux is 64 bit with GHC-6.10.1.
This corresponds to my experiences - 64 bits is slower, something I've ascribed to the cost of increased pointer size. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Tue, Nov 25, 2008 at 09:39:35PM +0100, Ketil Malde wrote:
This corresponds to my experiences - 64 bits is slower, something I've ascribed to the cost of increased pointer size.
ghc unfortunatly also uses 64 bit integers when in 64 bit mode, so the cost paid is increased due to that as well, Also since each math instruction needs an extra byte telling it to work on 64 bit data so the code is less dense. John -- John Meacham - ⑆repetae.net⑆john⑈

On Wednesday 26 November 2008 02:16:26 John Meacham wrote:
On Tue, Nov 25, 2008 at 09:39:35PM +0100, Ketil Malde wrote:
This corresponds to my experiences - 64 bits is slower, something I've ascribed to the cost of increased pointer size.
ghc unfortunatly also uses 64 bit integers when in 64 bit mode, so the cost paid is increased due to that as well, Also since each math instruction needs an extra byte telling it to work on 64 bit data so the code is less dense.
I've done little exeriment to confirm this. Created simple pgm and ran it with +RTS -s option on couple different harware&OS configurations. main = (putStrLn . show . head . drop 500000) prim divides d n = rem n d == 0 ldf' :: (Integral a) => [a] -> a -> a ldf' (k:ks) n | divides k n = k | k^2 > n = n | otherwise = ldf' ks n prim = filter (\x -> ldf' (2:prim) x == x) [3..] Results of experiment: Win32 Core2Duo 1.8GHz 1GB RAM 17 Mb total memory in use MUT time 56.97s ( 57.02s elapsed) %GC time 0.5% Win32 Core2Duo 2.2GHz 2GB RAM 17 Mb total memory in use MUT time 57.44s ( 57.53s elapsed) %GC time 0.7% (0.8% elapsed) Win32 P4 2.8GHz 1GB RAM 17 Mb total memory in use MUT time 171.64s (175.78s elapsed) %GC time 1.7% (1.5% elapsed) Linux64 Core2Duo 2.2GHz 2GB RAM 41 MB total memory in use (1 MB lost due to fragmentation) MUT time 68.26s ( 68.92s elapsed) %GC time 0.9% (1.1% elapsed) Linux32 Core2Duo 2.3GHz 4GB RAM 17 Mb total memory in use MUT time 51.77s ( 51.83s elapsed) %GC time 0.5% (0.6% elapsed) Experiment confirms your explanations. Also interesting how slow P4 is in comparison to C2D. Best and thanks. Bartek

Bartosz Wójcik
Win32 Core2Duo 1.8GHz 1GB RAM 17 Mb total memory in use MUT time 56.97s ( 57.02s elapsed) %GC time 0.5%
Win32 Core2Duo 2.2GHz 2GB RAM 17 Mb total memory in use MUT time 57.44s ( 57.53s elapsed) %GC time 0.7% (0.8% elapsed)
So, despite the CPU being 25% faster, it's exactly as fast. Memory bound?
Win32 P4 2.8GHz 1GB RAM 17 Mb total memory in use MUT time 171.64s (175.78s elapsed) %GC time 1.7% (1.5% elapsed)
You're doing divisions, and I seem to remember division being an operation that wreaked havoc with the P4's ALU or trace cache, or something like that.
Linux64 Core2Duo 2.2GHz 2GB RAM 41 MB total memory in use (1 MB lost due to fragmentation) MUT time 68.26s ( 68.92s elapsed) %GC time 0.9% (1.1% elapsed)
Linux32 Core2Duo 2.3GHz 4GB RAM 17 Mb total memory in use MUT time 51.77s ( 51.83s elapsed) %GC time 0.5% (0.6% elapsed)
Interesting that Linux32 is actually faster than Win32. Different cache sizes? -k -- If I haven't seen further, it is by standing in the footprints of giants

John Meacham wrote:
On Tue, Nov 25, 2008 at 09:39:35PM +0100, Ketil Malde wrote:
This corresponds to my experiences - 64 bits is slower, something I've ascribed to the cost of increased pointer size.
ghc unfortunatly also uses 64 bit integers when in 64 bit mode, so the cost paid is increased due to that as well, Also since each math instruction needs an extra byte telling it to work on 64 bit data so the code is less dense.
Right - in the Java world they use tricks to keep pointers down to 32-bits on a 64-bit platform, e.g. by shifting pointers by a couple of bits (giving you access to 16Gb). There are a number of problems with doing this in GHC, though: - we already use those low pointer bits for encoding tag information. So perhaps we could take only one bit, giving you access to 8Gb, and lose one tag bit. - it means recompiling *everything*. It's a complete new way, so you have to make the decision to do this once and for all, or build all your libraries + RTS twice. In JITed languages they can make the choice at runtime, which makes it much easier. - it tends to be a bit platform-specific, because you need a base address in the address space for your 16Gb of memory, and different platforms lay out the address space differently. The nice thing about GHC's memory manager is that it currently has *no* dependencies on address-space layout (except in the ELF64 dynamic linker... sigh). - as usual with code-generation knobs, it multiplies the testing surface, which is something we're quite sensitive to (our surface is already on the verge of being larger than we can cope with given our resources). So my current take on this is that it isn't worth it just to get access to more memory and slightly improved performance. However, perhaps we should work on making it easier to use the 32-bit GHC on 64-bit platforms - IIRC right now you have to use something like -opta-m32 -optc-m32. Cheers, Simon

On Friday 12 December 2008 13:57:44 Simon Marlow wrote:
- it means recompiling *everything*. It's a complete new way, so you have to make the decision to do this once and for all, or build all your libraries + RTS twice. In JITed languages they can make the choice at runtime, which makes it much easier.
Is anyone developing a JIT Haskell compiler? The approach has many important practical benefits and tools like LLVM are a joy to use... -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e

Is the windows 32 or 64 bit, a while ago, ghc had trouble producing efficient binaries for 64 bit intel systems. Something about the interaction between gcc and the C it produced created some pessimal assembly output. I do not know how much this is still an issue though. You could try compiling 32 bit binaries under linux and running them on the same machine (they will work on the 64 bit system) and compare the results. John -- John Meacham - ⑆repetae.net⑆john⑈

john:
Is the windows 32 or 64 bit, a while ago, ghc had trouble producing efficient binaries for 64 bit intel systems. Something about the interaction between gcc and the C it produced created some pessimal assembly output. I do not know how much this is still an issue though. You could try compiling 32 bit binaries under linux and running them on the same machine (they will work on the 64 bit system) and compare the results.
I get better code on x86_64/linux than on x86/linux, fwiw, thanks to my trusty gcc version 4.3.2.
participants (7)
-
Bartosz Wójcik
-
Bulat Ziganshin
-
Don Stewart
-
John Meacham
-
Jon Harrop
-
Ketil Malde
-
Simon Marlow