Analyzing slow performance of a Haskell program

Chris Yuen

7 Aug 2011 7 Aug '11

1:15 a.m.

A wordNumber is defined as

wordNumber 1 = "one" wordNumber 2 = "onetwo" wordNumber 3 = "onethree" wordNumber 15 = "onetwothreefourfivesixseveneightnineteneleventwelvethirteenfourteenfifteen" ...

Problem: Find the 51-billion-th letter of (wordNumber Infinity); assume

...

From an imperative perspective, a naive algorithm would be to have 2 counters, keep counting the length of each wordNumber and "break" to return

Hello Cafe, I was trying to solve ITA Software's "Word Nubmers" puzzle ( http://www.itasoftware.com/careers/puzzle_archive.html) using a brute force approach. Here's a my version of a quick recap of the problem: that letter is found at 'wordNumber x', also find 'sum [1..x]' the result. The imperative brute-force approach is implemented in C# here: http://ideone.com/JjCb3. It takes about 1.5 minutes to find the answer on my computer. Then I implemented a brute-force Haskell version: http://ideone.com/ngfFq. It cannot finish the calculation in 5 minutes on my machine. (Irony: it's has more lines than the C# version) Here is the `-p` profile of the Haskell program: http://hpaste.org/49934 The Question: Are there obvious mistakes I am making? (Note: I am fully aware that brute-forcing it is not the correct solution to this problem. I am mainly interested in making the Haskell version perform comparatively to the C# version. Right now it is at least 5x slower so obviously I am missing something obvious) (Note 2: It does not seem to be space leaking. The program runs with constant memory (about 2MB) on my computer) (Note 3: I am compiling with `ghc -O2 WordNumber.hs) Chris

Attachments:

attachment.html (text/html — 2.0 KB)

Show replies by date

Max Bolingbroke

7 Aug 7 Aug

4:52 a.m.

On 7 August 2011 06:15, Chris Yuen wrote:

...

I am mainly interested in making the Haskell version perform comparatively to the C# version. Right now it is at least 5x slower so obviously I am missing something obvious)

You have a "map" call which is immediately consumed by "solve". GHCs fusion wont' help you because "solve" is not defined in terms of foldr. Fusing this manually (http://hpaste.org/49936) you can get 10% improvement. Another source of problems is the divMod calls. There are two issues: 1. The result of the divMod is a pair of boxed Int64s. This can be worked around by using div and mod seperately instead, but that is actually slower even though it avoids the boxing. 2. The divMod is "checked": i.e. it throws a Haskell exception if the first argument is minBound or the second is 0. This means that divMod does two equality checks and one unboxing operation (which will just be an always-taken branch, thanks to pointer tagging) before it actually reaches GHC.Base.divInt# If I use divInt# and modInt# directly like so: {{{ wordLength' :: Int64 -> Int64 -> Int64 wordLength' !pad !n@(I64# n#) | n < 10 = lenOnes n + pad | n < 20 = lenTeens (n-10) + pad | n < 100 = splitterTen | n < 1000 = splitter 100 7 | n < 1000000 = splitter 1000 8 | otherwise = splitter 1000000 7 where splitterTen = let -- !(!t, !x) = n `divMod` 10 t = n# `divInt#` 10# x = n# `modInt#` 10# in wordLength' (lenTens (I64# t) + pad) (I64# x) splitter !(I# d#) !suffix = let -- !(!t, !x) = n `divMod` d t = n# `divInt#` d# x = n# `modInt#` d# in wordLength' (wordLength' (suffix+pad) (I64# t)) (I64# x) }}} We sacrifice these checks but the code gets 25% faster again. I can't see anything else obviously wrong with the core, so the remaining issues are likely to be things like loop unrolling, turning div by a constant int divisor into a multiply and other code generation issues. I tried -fllvm but it has no effect. At a guess, this is because optimisations are impeded by the call to stg_gc_fun in the stack check that solve makes. In short I don't see how to get further without changing the algorithm or doing some hacks like manual unrolling. Maybe someone else has some ideas? Max

Daniel Fischer

5:20 a.m.

On Sunday 07 August 2011, 10:52:20, Max Bolingbroke wrote:

...

In short I don't see how to get further without changing the algorithm or doing some hacks like manual unrolling. Maybe someone else has some ideas?

Well, the C# implementation uses arrays for lookup while the Haskell version uses list lookups in (tens !! fromIntegral t) ++ wordify x and case'd functions lenTens 0 = 0 lenTens 1 = 3 lenTens 2 = 6 lenTens 3 = 6 lenTens 4 = 5 lenTens 5 = 5 lenTens 6 = 5 lenTens 7 = 7 lenTens 8 = 6 lenTens 9 = 6 wordify is only called once at the end, so that should not have a measurable impact, but the lenXXXs might. I'm not sure what CaseLen.$wlenTens :: GHC.Prim.Int# -> GHC.Prim.Int# [GblId, Arity=1, Str=DmdType L, Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=1, Value=True, ConLike=True, Cheap=True, Expandable=True, Guidance=IF_ARGS [12] 11 0}] CaseLen.$wlenTens = \ (ww_shY :: GHC.Prim.Int#) -> case ww_shY of _ { __DEFAULT -> CaseLen.lenTens1 `cast` (CoUnsafe GHC.Types.Int GHC.Prim.Int# :: GHC.Types.Int ~ GHC.Prim.Int#); 0 -> 0; 1 -> 3; 2 -> 6; 3 -> 6; 4 -> 5; 5 -> 5; 6 -> 5; 7 -> 7; 8 -> 6; 9 -> 6 } means at a lower level, but it's certainly worth trying out whether an unboxed array lookup is faster.

Chris Yuen

9:18 a.m.

Here is an updated version using Data.Array.Unboxed http://ideone.com/YXuVL And the profile http://hpaste.org/49940 Still taking 5+ minutes... Chris On Sun, Aug 7, 2011 at 5:20 PM, Daniel Fischer < daniel.is.fischer@googlemail.com> wrote:

...

On Sunday 07 August 2011, 10:52:20, Max Bolingbroke wrote:

...
In short I don't see how to get further without changing the algorithm or doing some hacks like manual unrolling. Maybe someone else has some ideas?

Well, the C# implementation uses arrays for lookup while the Haskell version uses list lookups

in (tens !! fromIntegral t) ++ wordify x

and case'd functions

lenTens 0 = 0 lenTens 1 = 3 lenTens 2 = 6 lenTens 3 = 6 lenTens 4 = 5 lenTens 5 = 5 lenTens 6 = 5 lenTens 7 = 7 lenTens 8 = 6 lenTens 9 = 6

wordify is only called once at the end, so that should not have a measurable impact, but the lenXXXs might. I'm not sure what

CaseLen.$wlenTens :: GHC.Prim.Int# -> GHC.Prim.Int# [GblId, Arity=1, Str=DmdType L, Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=1, Value=True, ConLike=True, Cheap=True, Expandable=True, Guidance=IF_ARGS [12] 11 0}] CaseLen.$wlenTens = \ (ww_shY :: GHC.Prim.Int#) -> case ww_shY of _ { __DEFAULT -> CaseLen.lenTens1 `cast` (CoUnsafe GHC.Types.Int GHC.Prim.Int# :: GHC.Types.Int ~ GHC.Prim.Int#); 0 -> 0; 1 -> 3; 2 -> 6; 3 -> 6; 4 -> 5; 5 -> 5; 6 -> 5; 7 -> 7; 8 -> 6; 9 -> 6 }

means at a lower level, but it's certainly worth trying out whether an unboxed array lookup is faster.

Eugene Kirpichov

1:40 p.m.

What about using unsafe array indexing operations? (i.e. array `unsafeAt` index) 2011/8/7 Chris Yuen :

...

Here is an updated version using Data.Array.Unboxed http://ideone.com/YXuVL And the profile http://hpaste.org/49940

Still taking 5+ minutes...

Chris

On Sun, Aug 7, 2011 at 5:20 PM, Daniel Fischer wrote:

...
On Sunday 07 August 2011, 10:52:20, Max Bolingbroke wrote:

...
In short I don't see how to get further without changing the algorithm or doing some hacks like manual unrolling. Maybe someone else has some ideas?

Well, the C# implementation uses arrays for lookup while the Haskell version uses list lookups

in (tens !! fromIntegral t) ++ wordify x

and case'd functions

lenTens 0 = 0 lenTens 1 = 3 lenTens 2 = 6 lenTens 3 = 6 lenTens 4 = 5 lenTens 5 = 5 lenTens 6 = 5 lenTens 7 = 7 lenTens 8 = 6 lenTens 9 = 6

wordify is only called once at the end, so that should not have a measurable impact, but the lenXXXs might. I'm not sure what

CaseLen.$wlenTens :: GHC.Prim.Int# -> GHC.Prim.Int# [GblId, Arity=1, Str=DmdType L, Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=1, Value=True, ConLike=True, Cheap=True, Expandable=True, Guidance=IF_ARGS [12] 11 0}] CaseLen.$wlenTens = \ (ww_shY :: GHC.Prim.Int#) -> case ww_shY of _ { __DEFAULT -> CaseLen.lenTens1 `cast` (CoUnsafe GHC.Types.Int GHC.Prim.Int# :: GHC.Types.Int ~ GHC.Prim.Int#); 0 -> 0; 1 -> 3; 2 -> 6; 3 -> 6; 4 -> 5; 5 -> 5; 6 -> 5; 7 -> 7; 8 -> 6; 9 -> 6 }

means at a lower level, but it's certainly worth trying out whether an unboxed array lookup is faster.

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/

Chris Yuen

8 Aug 8 Aug

12:24 p.m.

Where is the `unsafeAt` function? I can't seem to find it ( http://haskell.org/hoogle/?hoogle=unsafeat). For reference I have asked the same question on StackOverflow. One person suggested that the reason might be that Int64 on Windows is broken ( http://stackoverflow.com/questions/6970904/analyzing-slow-performance-of-a-h... ). I tried the same test on Arch Linux x64 (GHC 7.0.3) but it still can't complete in 3 minutes, where as a new C++ version I wrote completes in 45 seconds (because I didn't want to use Mono for benchmarks. For reference here is the C++ implementation http://ideone.com/vZGhh (Again, ironically shorter than Haskell and actually looks quite clean)) The profile under x64 Linux is similar to the one posted before -- most allocations and time spent in wordLength'. It seems mysterious that such an innocent program is so obscure to write "correctly" in Haskell :P Chris On Mon, Aug 8, 2011 at 1:40 AM, Eugene Kirpichov wrote:

...

What about using unsafe array indexing operations? (i.e. array `unsafeAt` index)

2011/8/7 Chris Yuen :

...
Here is an updated version using Data.Array.Unboxed http://ideone.com/YXuVL And the profile http://hpaste.org/49940

Still taking 5+ minutes...

Chris

On Sun, Aug 7, 2011 at 5:20 PM, Daniel Fischer wrote:

...
On Sunday 07 August 2011, 10:52:20, Max Bolingbroke wrote:

...
In short I don't see how to get further without changing the algorithm or doing some hacks like manual unrolling. Maybe someone else has some ideas?

Well, the C# implementation uses arrays for lookup while the Haskell version uses list lookups

in (tens !! fromIntegral t) ++ wordify x

and case'd functions

lenTens 0 = 0 lenTens 1 = 3 lenTens 2 = 6 lenTens 3 = 6 lenTens 4 = 5 lenTens 5 = 5 lenTens 6 = 5 lenTens 7 = 7 lenTens 8 = 6 lenTens 9 = 6

wordify is only called once at the end, so that should not have a measurable impact, but the lenXXXs might. I'm not sure what

CaseLen.$wlenTens :: GHC.Prim.Int# -> GHC.Prim.Int# [GblId, Arity=1, Str=DmdType L, Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=1, Value=True, ConLike=True, Cheap=True, Expandable=True, Guidance=IF_ARGS [12] 11 0}] CaseLen.$wlenTens = \ (ww_shY :: GHC.Prim.Int#) -> case ww_shY of _ { __DEFAULT -> CaseLen.lenTens1 `cast` (CoUnsafe GHC.Types.Int GHC.Prim.Int# :: GHC.Types.Int ~ GHC.Prim.Int#); 0 -> 0; 1 -> 3; 2 -> 6; 3 -> 6; 4 -> 5; 5 -> 5; 6 -> 5; 7 -> 7; 8 -> 6; 9 -> 6 }

means at a lower level, but it's certainly worth trying out whether an unboxed array lookup is faster.

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/

Daniel Fischer

1:54 p.m.

On Monday 08 August 2011, 18:24:45, Chris Yuen wrote:

...

Where is the `unsafeAt` function?

Data.Array.Base

...

I can't seem to find it ( http://haskell.org/hoogle/?hoogle=unsafeat).

Data.Array.Base is not haddocked (there's a reason for that), so hoogle doesn't know about its functions.

...

For reference I have asked the same question on StackOverflow. One person suggested that the reason might be that Int64 on Windows is broken ( http://stackoverflow.com/questions/6970904/analyzing-slow-performance-o f-a-haskell-program/6976448#6976448 ).

As far as I know, there's no 64-bit GHC for windows, so that might well have performance impact even on 64-bit windows.

Bryan O'Sullivan

8:06 p.m.

On Mon, Aug 8, 2011 at 9:24 AM, Chris Yuen wrote:

...

For reference I have asked the same question on StackOverflow. One person suggested that the reason might be that Int64 on Windows is broken ( http://stackoverflow.com/questions/6970904/analyzing-slow-performance-of-a-h... ).

No, they're barking up the wrong tree. I've put an idiomatic Haskell translation of your C++ algorithm at https://gist.github.com/1133048#file_wordy.hs (I've also included a copy of your original C++, with a bug fixed, in the same gist.) As you can see, the two are almost identical. Not surprisingly, each one spends the bulk of its time computing word lengths. GHC simply doesn't do a great job of compiling fairly tight code like this. gcc generates about 100 lines of assembly that's mostly easy to follow (except for some bit-twiddling tricks to avoid div instructions). Although the Core it generates looks fine, GHC spends quite a bit of time in its generated assembly on what looks to me like STG housekeeping (it spends only 0.3% of its time in the garbage collector, because it doesn't allocate memory). The overall result is that the Haskell code runs about 5x more slowly than the C++ code.

Reiner Pope

9:09 p.m.

On 9 August 2011 10:06, Bryan O'Sullivan wrote:

...

On Mon, Aug 8, 2011 at 9:24 AM, Chris Yuen wrote:

...
For reference I have asked the same question on StackOverflow. One person suggested that the reason might be that Int64 on Windows is broken ( http://stackoverflow.com/questions/6970904/analyzing-slow-performance-of-a-h... ).

No, they're barking up the wrong tree.

I've put an idiomatic Haskell translation of your C++ algorithm at https://gist.github.com/1133048#file_wordy.hs

(I've also included a copy of your original C++, with a bug fixed, in the same gist.)

As you can see, the two are almost identical. Not surprisingly, each one spends the bulk of its time computing word lengths.

GHC simply doesn't do a great job of compiling fairly tight code like this. gcc generates about 100 lines of assembly that's mostly easy to follow (except for some bit-twiddling tricks to avoid div instructions). Although the Core it generates looks fine, GHC spends quite a bit of time in its generated assembly on what looks to me like STG housekeeping (it spends only 0.3% of its time in the garbage collector, because it doesn't allocate memory). The overall result is that the Haskell code runs about 5x more slowly than the C++ code.

GHC generating bad assembly suggests trying the llvm codegen (see http://donsbot.wordpress.com/2010/02/21/smoking-fast-haskell-code-using-ghcs...). Compiling Bryan's code with $ ghc -O2 -fllvm Wordy.hs it now runs only 2x slower than the C++ code. Reiner

Chris Yuen

9 Aug 9 Aug

12:47 p.m.

Hi all, Thanks Bryan, reading your clean code was good for my Haskell health :) I took your code and did some more research. I think I have found the answer. I have written an extensive analysis in my blog post http://cfc.kizzx2.com/index.php/in-search-of-performance-in-haskell/(comment... are very much welcome, btw :) Here are summaries of key points: - I was using GHC 32-bit. Int is 32-bit there, so I needed Int64. It turns out 64-bit operations in 32-bit programs are just darn slow. Maybe it's a Windows problem. On Linux 64 bit GHC Int is 64 bit so everything just works. Changing Int64 to Int liberates me from many `fromIntegral` which saved 20% - Changing `divMod` to `quotRem` saved another 20% - Using `Data.Vector.Unboxed` and `unsafeIndex` saved another 15% or so - Moving the "length" arrays to `where` clause in `solve` with bang patterns on them save some more. This was a great learning experience! Now I have more questions :P 1. Why are bangs needed on the length arrays? If I remove them from below, performance drops 10%. I thought `unsafeIndex` is straight in both arguments, no? wordLength i = go i where go n | n < 10 = lengthOnes !! n | n < 20 = lengthTeens !! (n-10) | n < 100 = (lengthTens !! (n // 10)) + (lengthOnes !! (n % 10)) | n < 1000 = (lengthOnes !! (n // 100)) + 7 + go (n % 100) | n < 1000000 = go (n // 1000) + 8 + go (n % 1000) | otherwise = go (n // 1000000) + 7 + go (n % 1000000) !lengthOnes = lengthVec ones !lengthTens = lengthVec tens !lengthTeens = lengthVec teens 2. Why the single element worker wrapper pattern (`go` functions) increases performance? If we change wordLength to wordLength n | n < 10 = lengthOnes !! n | n < 20 = lengthTeens !! (n-10) | n < 100 = (lengthTens !! (n // 10)) + (lengthOnes !! (n % 10)) | n < 1000 = (lengthOnes !! (n // 100)) + 7 + wordLength (n % 100) | n < 1000000 = wordLength (n // 1000) + 8 + wordLength (n % 1000) | otherwise = wordLength (n // 1000000) + 7 + wordLength (n % 1000000) where !lengthOnes = lengthVec ones !lengthTens = lengthVec tens !lengthTeens = lengthVec teens The performance drops by another 10%. This really surprised me. `go i` seemed obvious to me and I don't understand how it could make any difference. The full source code is available to GHC so it shouldn't be related to call-by-pointer problem? If this is the case, shouldn't we always wrap a "go" function for **any** recursive functions? Thanks! Chris On Tue, Aug 9, 2011 at 9:09 AM, Reiner Pope wrote:

...

On 9 August 2011 10:06, Bryan O'Sullivan wrote:

...
On Mon, Aug 8, 2011 at 9:24 AM, Chris Yuen wrote:

...
For reference I have asked the same question on StackOverflow. One person suggested that the reason might be that Int64 on Windows is broken ( http://stackoverflow.com/questions/6970904/analyzing-slow-performance-of-a-h... ).

No, they're barking up the wrong tree.

I've put an idiomatic Haskell translation of your C++ algorithm at https://gist.github.com/1133048#file_wordy.hs

(I've also included a copy of your original C++, with a bug fixed, in the same gist.)

As you can see, the two are almost identical. Not surprisingly, each one spends the bulk of its time computing word lengths.

GHC simply doesn't do a great job of compiling fairly tight code like this. gcc generates about 100 lines of assembly that's mostly easy to follow (except for some bit-twiddling tricks to avoid div instructions). Although the Core it generates looks fine, GHC spends quite a bit of time in its generated assembly on what looks to me like STG housekeeping (it spends only 0.3% of its time in the garbage collector, because it doesn't allocate memory). The overall result is that the Haskell code runs about 5x more slowly than the C++ code.

GHC generating bad assembly suggests trying the llvm codegen (see http://donsbot.wordpress.com/2010/02/21/smoking-fast-haskell-code-using-ghcs...). Compiling Bryan's code with

$ ghc -O2 -fllvm Wordy.hs

it now runs only 2x slower than the C++ code.

Reiner

Johan Tibell

1:44 p.m.

Hi Chris, On Tue, Aug 9, 2011 at 12:47 PM, Chris Yuen wrote:

...

1. Why are bangs needed on the length arrays?

If I remove them from below, performance drops 10%. I thought `unsafeIndex` is straight in both arguments, no?

wordLength i = go i where go n | n < 10 = lengthOnes !! n | n < 20 = lengthTeens !! (n-10) | n < 100 = (lengthTens !! (n // 10)) + (lengthOnes !! (n % 10)) | n < 1000 = (lengthOnes !! (n // 100)) + 7 + go (n % 100) | n < 1000000 = go (n // 1000) + 8 + go (n % 1000) | otherwise = go (n // 1000000) + 7 + go (n % 1000000) !lengthOnes = lengthVec ones !lengthTens = lengthVec tens !lengthTeens = lengthVec teens

(It's "strict", not "straight".) The different lengths are not used in all branches and since Haskell is a lazy (or to be pendantic: non-strict) language we cannot compute them before knowing which branch will be evaluated. For example, given that we have ones = ... tens = error "Boom!" test = wordLength 0 evaluating 'test' should not cause an exception to be raised as the first (n < 10) branch is taken, but it would if lengthOnes was strict. Delaying the evaluation has some costs, namely allocating a thunk for e.g. `lengthVec ones` and later evaluate that thunk. By making the lengths strict we can evaluate them earlier and avoid some allocation and forcing of thunks.

...

2. Why the single element worker wrapper pattern (`go` functions) increases performance?

If we change wordLength to

wordLength n | n < 10 = lengthOnes !! n | n < 20 = lengthTeens !! (n-10) | n < 100 = (lengthTens !! (n // 10)) + (lengthOnes !! (n % 10)) | n < 1000 = (lengthOnes !! (n // 100)) + 7 + wordLength (n % 100) | n < 1000000 = wordLength (n // 1000) + 8 + wordLength (n % 1000) | otherwise = wordLength (n // 1000000) + 7 + wordLength (n % 1000000) where !lengthOnes = lengthVec ones !lengthTens = lengthVec tens !lengthTeens = lengthVec teens

The performance drops by another 10%. This really surprised me. `go i` seemed obvious to me and I don't understand how it could make any difference. The full source code is available to GHC so it shouldn't be related to call-by-pointer problem? If this is the case, shouldn't we always wrap a "go" function for **any** recursive functions?

Making wordLength non-recursive lets GHC inline it, which can sometimes help performance (e.g. if the inlining enables more optimizations). Inlining does increase code size (and sometimes allocation if a closure has to be allocated to capture free variables), so it's not always a good idea. Cheers, Johan

Bryan O'Sullivan

2:16 p.m.

On Tue, Aug 9, 2011 at 9:47 AM, Chris Yuen wrote:

...

- I was using GHC 32-bit. Int is 32-bit there, so I needed Int64. It turns out 64-bit operations in 32-bit programs are just darn slow. Maybe it's a Windows problem.

No, GHC calls out to C for 64-bit integer ops on all 32-bit platforms.

...

On Linux 64 bit GHC Int is 64 bit so everything just works. Changing Int64 to Int liberates me from many `fromIntegral` which saved 20%

Actually, fromIntegral is usually a no-op, so chances are you're seeing the effect of something else.

...

- Changing `divMod` to `quotRem` saved another 20%

It's cheaper again to use quotInt# and remInt# as I did in my code.

...

1. Why are bangs needed on the length arrays?

GHC has to deconstruct the Vector in order to get at the real underlying array, so that unsafeIndex can perform the actual index into the real underlying array. Without bang patterns, the code performs at least one deconstruction on every iteration through the loop. Each deconstruction has a cost. With the bang patterns, they're all hoisted out and performed just once.

Chris Yuen

7:05 p.m.

...

It's cheaper again to use quotInt# and remInt# as I did in my code.

Just want to point out that this is actually surprisingly less of an impact than I thought. As I did the step-by-step break down (you can see my blog post), this one (changing `quotRem` to `quotInt#` and `remInt#`) yielded almost negligible differences. Every other change had more impact than this, including turning the `Data.Array.Unboxed` into "lookup functions" (like `lenOnes 0 = 0; lenOnes 1 = 3; lenOnes 2 = 3; ...`) Chris On Wed, Aug 10, 2011 at 2:16 AM, Bryan O'Sullivan wrote:

...

On Tue, Aug 9, 2011 at 9:47 AM, Chris Yuen wrote:

...
- I was using GHC 32-bit. Int is 32-bit there, so I needed Int64. It turns out 64-bit operations in 32-bit programs are just darn slow. Maybe it's a Windows problem.

No, GHC calls out to C for 64-bit integer ops on all 32-bit platforms.

...
On Linux 64 bit GHC Int is 64 bit so everything just works. Changing Int64 to Int liberates me from many `fromIntegral` which saved 20%

Actually, fromIntegral is usually a no-op, so chances are you're seeing the effect of something else.

...
- Changing `divMod` to `quotRem` saved another 20%

It's cheaper again to use quotInt# and remInt# as I did in my code.

...
1. Why are bangs needed on the length arrays?

GHC has to deconstruct the Vector in order to get at the real underlying array, so that unsafeIndex can perform the actual index into the real underlying array.

Without bang patterns, the code performs at least one deconstruction on every iteration through the loop. Each deconstruction has a cost. With the bang patterns, they're all hoisted out and performed just once.

5038

Age (days ago)

5040

Last active (days ago)

List overview

Download

12 comments

7 participants

participants (7)

Bryan O'Sullivan
Chris Yuen
Daniel Fischer
Eugene Kirpichov
Johan Tibell
Max Bolingbroke
Reiner Pope