Understanding allocation behavior

David F. Place

7 Apr 2006 7 Apr '06

4:57 p.m.

After reading Daniel Fischer's message about trying to use EnumSet in his Sudoku.hs and finding that it was slower when processing a large data set, I decided to do some profiling. I rewrote his solver to use EnumSets and profiled it. The culprit turns out to be the following function which is responsible for 22% of the allocating in the run. Is there a more efficient way to write this function? foldBits :: Bits c => (a -> Int -> a) -> a -> c -> a foldbits _ z 0 = z foldBits f z bs = foldBits' f 0 bs z foldBits' :: Bits c => (a -> Int -> a) -> Int -> c -> a -> a foldBits' f i bs z | bs == 0 = z | otherwise = z' `seq` foldBits' f i' bs' z' where z' | 1 == bs .&. 1 = f z i | otherwise = z i' = i + 1 bs' = bs `shiftR` 1 ps. I was impressed with how hairy DF's algorithm is and I am not really enough interested in Sudoku to spend the time needed to grok it. So, I decided to try an experiment to see if I could restructure it without understanding it very deeply. Step 1. comment out all the type signatures. Step 2. find the main place that I wanted to change from [Int] to (Set Int) Step 3. compile; make obvious edits; repeat until 0 errors I had it running in a few minutes. I can't imagine doing that in any other programming environment! Cheers, David -------------------------------- David F. Place mailto:d@vidplace.com

Show replies by date

Daniel Fischer

7 Apr 7 Apr

8:21 p.m.

Am Freitag, 7. April 2006 22:57 schrieb David F. Place:

...

After reading Daniel Fischer's message about trying to use EnumSet in his Sudoku.hs and finding that it was slower when processing a large data set, I decided to do some profiling. I rewrote his solver to use EnumSets and profiled it. The culprit turns out to be the

The main & evil culprit, methinks now, was DiffArray and the small allocation area. Care to re-profile with my SudokuSet.hs ? Unless I overlooked something, I use foldBits only via size (though that's used a lot).

...

following function which is responsible for 22% of the allocating in the run. Is there a more efficient way to write this function?

foldBits :: Bits c => (a -> Int -> a) -> a -> c -> a foldbits _ z 0 = z foldBits f z bs = foldBits' f 0 bs z

foldBits' :: Bits c => (a -> Int -> a) -> Int -> c -> a -> a foldBits' f i bs z

| bs == 0 = z | otherwise = z' `seq` foldBits' f i' bs' z'

where z' | 1 == bs .&. 1 = f z i ^^^^^^^^^^^^^^^^^ testbit bs 0 ? and foldbits(') is only used for c = Word, so why the polymorphism?

| otherwise = z

i' = i + 1 bs' = bs `shiftR` 1

ps. I was impressed with how hairy DF's algorithm is and I am not

Now there are comments, I hope they explain what I do.

...

really enough interested in Sudoku to spend the time needed to grok it. So, I decided to try an experiment to see if I could restructure it without understanding it very deeply.

Step 1. comment out all the type signatures. Step 2. find the main place that I wanted to change from [Int] to (Set Int) Step 3. compile; make obvious edits; repeat until 0 errors

I had it running in a few minutes. I can't imagine doing that in any other programming environment!

Great! Triple Cheer for Haskell!!! I wonder how different your translation is to mine (I've no experience with bit-twiddling, but I tried to be as cheap as possible)

...

Cheers, David

Cheers, Daniel -- "In My Egotistical Opinion, most people's C programs should be indented six feet downward and covered with dirt." -- Blair P. Houghton

Bulat Ziganshin

8 Apr 8 Apr

4:24 a.m.

New subject: Re[2]: Understanding allocation behavior

Hello Daniel, Saturday, April 8, 2006, 4:21:14 AM, you wrote:

...

Unless I overlooked something, I use foldBits only via size (though that's used a lot).

size of set? there is much faster method - use a table [0..255] -> number of bits in this number seen as set then we split Word to the bytes and count total size of set by adding number of bits set in each byte foldBits can be made faster (may be) by adding strict annotations: foldBits :: Bits c => (a -> Int -> a) -> a -> c -> a foldbits _ z bs | z `seq` bs `seq` False = undefined foldBits' :: Bits c => (a -> Int -> a) -> Int -> c -> a -> a foldbits' _ i bs z | i `seq` bs `seq` z `seq` False = undefined moreover, GHC don't inline recursive functions! so foldbits' is out of luck and it seems that GHC generates polymorphic version that is of course very-very slow. what you can do? 1. use SPECIALIZE pragma. this allow to make faster version at least for typical cases (a=c=Int, for example) 2. use recursion on the internal foldbits' function. may be this will help to inline and therefore specialize each call to foldbits'. it's better to ask Simon Marlow about this -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Robert Dockins

1:21 p.m.

New subject: Re[2]: Understanding allocation behavior

On Apr 8, 2006, at 4:24 AM, Bulat Ziganshin wrote:

...

Hello Daniel,

Saturday, April 8, 2006, 4:21:14 AM, you wrote:

...
Unless I overlooked something, I use foldBits only via size (though that's used a lot).

size of set? there is much faster method - use a table

[0..255] -> number of bits in this number seen as set

Or: bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1)) -- | /O(1)/. The number of elements in the set. size :: Set a -> Int size (Set w) = bitcount w Taking a look at the generated core (with -O2) , bitcount gets unboxed the way I'd expect, so this might do the trick.

...

then we split Word to the bytes and count total size of set by adding number of bits set in each byte

foldBits can be made faster (may be) by adding strict annotations:

foldBits :: Bits c => (a -> Int -> a) -> a -> c -> a foldbits _ z bs | z `seq` bs `seq` False = undefined

foldBits' :: Bits c => (a -> Int -> a) -> Int -> c -> a -> a foldbits' _ i bs z | i `seq` bs `seq` z `seq` False = undefined

moreover, GHC don't inline recursive functions! so foldbits' is out of luck and it seems that GHC generates polymorphic version that is of course very-very slow. what you can do?

1. use SPECIALIZE pragma. this allow to make faster version at least for typical cases (a=c=Int, for example)

2. use recursion on the internal foldbits' function. may be this will help to inline and therefore specialize each call to foldbits'. it's better to ask Simon Marlow about this

-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG

David F. Place

1:58 p.m.

New subject: Re[2]: Understanding allocation behavior

Thanks Bulat and Robert. I implemented Bulat's idea as the following. It tests faster than Roberts. I use Robert's to compute the table. The performance seems satisfactory now. size :: Set a -> Int size (Set w) = countBits w where countBits w | w == 0 = 0 | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF) bitsTable :: Array Word Int bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]] bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1)) On Apr 8, 2006, at 1:21 PM, Robert Dockins wrote:

...

On Apr 8, 2006, at 4:24 AM, Bulat Ziganshin wrote:

...
Hello Daniel,

Saturday, April 8, 2006, 4:21:14 AM, you wrote:

...
Unless I overlooked something, I use foldBits only via size (though that's used a lot).

size of set? there is much faster method - use a table

[0..255] -> number of bits in this number seen as set

Or:

bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1))

-- | /O(1)/. The number of elements in the set. size :: Set a -> Int size (Set w) = bitcount w

-------------------------------- David F. Place mailto:d@vidplace.com

Bulat Ziganshin

2:55 p.m.

New subject: Re[4]: Understanding allocation behavior

Hello David, Saturday, April 8, 2006, 9:58:56 PM, you wrote:

...

bitsTable :: Array Word Int bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]]

bitsTable :: UArray Word Int bitsTable = listArray (0,255) $ map bitcount [0..255] UArray is much faster than Array but can be used only for simple datatypes (integral, floating point, Char, Bool). more info at http://haskell.org/haskellwiki/Arrays btw: you can use "UArray a Bool" to represent sets of ANY type `a` belonging to class Ix. It uses only one bit per elemnt in current implementation -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Robert Dockins

6:54 p.m.

New subject: Re[2]: Understanding allocation behavior

On Apr 8, 2006, at 1:58 PM, David F. Place wrote:

...

Thanks Bulat and Robert. I implemented Bulat's idea as the following. It tests faster than Roberts. I use Robert's to compute the table. The performance seems satisfactory now.

size :: Set a -> Int size (Set w) = countBits w where countBits w | w == 0 = 0 | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF)

bitsTable :: Array Word Int bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]]

bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1))

There's a couple of other nice bit-twiddily things you can do: countBits :: Word -> Int countBits w | w == 0 = 0 | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF) bitsTable :: Array Word Int bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]] bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1)) lsb :: Word -> Int lsb x = countBits ((x-1) .&. (complement x)) -- stolen from http://aggregate.org/MAGIC/ msb :: Word -> Int msb x0 = let x1 = x0 .|. (x0 `shiftR` 1) x2 = x1 .|. (x1 `shiftR` 2) x3 = x2 .|. (x2 `shiftR` 4) x4 = x3 .|. (x3 `shiftR` 8) x5 = x4 .|. (x4 `shiftR` 16) in countBits x5 - 1 findMinIndex :: Word -> Int findMinIndex 0 = error "EnumSet.findMin: empty set has no minimal element" findMinIndex w = lsb w findMaxIndex :: Word -> Int findMaxIndex 0 = error "EnumSet.findMax: empty set has no maximal element" findMaxIndex w = msb w Which should make all access to the greatest or least element O(1). I guess, come to think of it, all operations on EnumSet are O(1) by virtue of the set size being upper-bounded. At any rate this turns recursion into unboxable straight-line code and I think it does less allocations. Rob Dockins Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG

Daniel Fischer

9:30 p.m.

Hum, oddly, these actually slow things down. While the new size brought the sudoku17 time from ~570s down to ~490s, the new findMinIndex/findMaxIndex increased the time to ~515s, although hardly used. Why? Cheers, Daniel Am Sonntag, 9. April 2006 00:54 schrieb Robert Dockins:

...

On Apr 8, 2006, at 1:58 PM, David F. Place wrote:

...
Thanks Bulat and Robert. I implemented Bulat's idea as the following. It tests faster than Roberts. I use Robert's to compute the table. The performance seems satisfactory now.

size :: Set a -> Int size (Set w) = countBits w where countBits w

| w == 0 = 0 | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF)

bitsTable :: Array Word Int bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]]

bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1))

There's a couple of other nice bit-twiddily things you can do:

countBits :: Word -> Int countBits w

| w == 0 = 0 | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF)

bitsTable :: Array Word Int bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]]

bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1))

lsb :: Word -> Int lsb x = countBits ((x-1) .&. (complement x))

-- stolen from http://aggregate.org/MAGIC/ msb :: Word -> Int msb x0 = let x1 = x0 .|. (x0 `shiftR` 1) x2 = x1 .|. (x1 `shiftR` 2) x3 = x2 .|. (x2 `shiftR` 4) x4 = x3 .|. (x3 `shiftR` 8) x5 = x4 .|. (x4 `shiftR` 16) in countBits x5 - 1

findMinIndex :: Word -> Int findMinIndex 0 = error "EnumSet.findMin: empty set has no minimal element" findMinIndex w = lsb w

findMaxIndex :: Word -> Int findMaxIndex 0 = error "EnumSet.findMax: empty set has no maximal element" findMaxIndex w = msb w

Which should make all access to the greatest or least element O(1). I guess, come to think of it, all operations on EnumSet are O(1) by virtue of the set size being upper-bounded. At any rate this turns recursion into unboxable straight-line code and I think it does less allocations.

Rob Dockins

Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- "In My Egotistical Opinion, most people's C programs should be indented six feet downward and covered with dirt." -- Blair P. Houghton

Robert Dockins

10 Apr 10 Apr

9:53 a.m.

On Apr 8, 2006, at 9:30 PM, Daniel Fischer wrote:

...

Hum, oddly, these actually slow things down. While the new size brought the sudoku17 time from ~570s down to ~490s, the new findMinIndex/findMaxIndex increased the time to ~515s, although hardly used. Why?

Hard to say. I'd expect that if the bit twiddly parts get turned directly into the corresponding opcodes, it would help (but that's not certain). It's possible that GHC munges things somewhere in the pipe and accidently unoptimizes; its possible that strictness isn't correctly discovered in 'msb'. Or, it could be that whatever (probably superscalar) chip you're running does a better job with the loop (although that's hard to believe...), or its some sort of cache effect... or who knows. You'd probably have to study the core and/or disassembly to figure out exactly what's happened. I suppose its possible that the change had some bizarre ripple effect that somehow suppressed a helpful optimization in some other part of the program. At this point it sort of becomes black magic, and I must confess I'm no magician. Its disappointing that those lovely, inscrutable algorithms don't help, though ;-) Rob Dockins

...

Cheers, Daniel

Am Sonntag, 9. April 2006 00:54 schrieb Robert Dockins:

...
On Apr 8, 2006, at 1:58 PM, David F. Place wrote:

...
Thanks Bulat and Robert. I implemented Bulat's idea as the following. It tests faster than Roberts. I use Robert's to compute the table. The performance seems satisfactory now.

size :: Set a -> Int size (Set w) = countBits w where countBits w

| w == 0 = 0 | otherwise = countBits (w `shiftR` 8) + bitsTable! (w .&. 0xFF)

bitsTable :: Array Word Int bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]]

bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1))

There's a couple of other nice bit-twiddily things you can do:

countBits :: Word -> Int countBits w

| w == 0 = 0 | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF)

bitsTable :: Array Word Int bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]]

bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1))

lsb :: Word -> Int lsb x = countBits ((x-1) .&. (complement x))

-- stolen from http://aggregate.org/MAGIC/ msb :: Word -> Int msb x0 = let x1 = x0 .|. (x0 `shiftR` 1) x2 = x1 .|. (x1 `shiftR` 2) x3 = x2 .|. (x2 `shiftR` 4) x4 = x3 .|. (x3 `shiftR` 8) x5 = x4 .|. (x4 `shiftR` 16) in countBits x5 - 1

findMinIndex :: Word -> Int findMinIndex 0 = error "EnumSet.findMin: empty set has no minimal element" findMinIndex w = lsb w

findMaxIndex :: Word -> Int findMaxIndex 0 = error "EnumSet.findMax: empty set has no maximal element" findMaxIndex w = msb w

Which should make all access to the greatest or least element O(1). I guess, come to think of it, all operations on EnumSet are O(1) by virtue of the set size being upper-bounded. At any rate this turns recursion into unboxable straight-line code and I think it does less allocations.

Rob Dockins

Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

--

"In My Egotistical Opinion, most people's C programs should be indented six feet downward and covered with dirt." -- Blair P. Houghton

Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG

Bulat Ziganshin

9 Apr 9 Apr

6:31 a.m.

New subject: Re[4]: Understanding allocation behavior

Hello Robert, Sunday, April 9, 2006, 2:54:58 AM, you wrote:

...

findMinIndex :: Word -> Int findMaxIndex :: Word -> Int

on the other side, these procedures can use the same divide-to-bytes technique as `size` findMinIndex 0 = undefined findMinIndex n = case (n `shiftR` 8) of 0 -> minIndexInByte ! (n .&. 255) b -> 8 + findMinIndex b something like this should work -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

David F. Place

9:47 a.m.

New subject: Re[4]: Understanding allocation behavior

Hi Bulat, On Apr 9, 2006, at 6:31 AM, Bulat Ziganshin wrote:

...

on the other side, these procedures can use the same divide-to-bytes technique as `size`

findMinIndex 0 = undefined findMinIndex n = case (n `shiftR` 8) of 0 -> minIndexInByte ! (n .&. 255) b -> 8 + findMinIndex b

something like this should work

In an email to me, Jean-Philippe Bernardy expressed a concern that a large table could needlessly fill the data cache. He proposed checking 4 bits at a time and using a small table of 16 elements. Not surprisingly, it isn't as fast. I wonder what you think of this. Also, I wonder what would be a good test to demonstrate this possible interaction with the cache. Cheers, David ps. Thanks for the tip about UArray. -------------------------------- David F. Place mailto:d@vidplace.com

Bulat Ziganshin

1:23 p.m.

New subject: Re[6]: Understanding allocation behavior

Hello David, Sunday, April 9, 2006, 5:47:03 PM, you wrote:

...

In an email to me, Jean-Philippe Bernardy expressed a concern that a large table could needlessly fill the data cache. He proposed checking 4 bits at a time and using a small table of 16 elements. Not surprisingly, it isn't as fast. I wonder what you think of this. Also, I wonder what would be a good test to demonstrate this possible interaction with the cache.

it depends on the actual task, CPU and other programs running at the same time :) i just can say that it will be better to use array of Word8 (with `fromIntegral` for conversion) and use `unsafeAt` for indexing -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

David F. Place

8 Apr 8 Apr

2 p.m.

New subject: Re[2]: Understanding allocation behavior

On Apr 8, 2006, at 4:24 AM, Bulat Ziganshin wrote:

...

foldBits can be made faster (may be) by adding strict annotations:

foldBits :: Bits c => (a -> Int -> a) -> a -> c -> a foldbits _ z bs | z `seq` bs `seq` False = undefined

foldBits' :: Bits c => (a -> Int -> a) -> Int -> c -> a -> a foldbits' _ i bs z | i `seq` bs `seq` z `seq` False = undefined

Indeed, I had tried this. It is slower for reasons that are mysterious to me. -------------------------------- David F. Place mailto:d@vidplace.com

6983

Age (days ago)

6986

Last active (days ago)

List overview

Download

12 comments

4 participants

participants (4)

Bulat Ziganshin
Daniel Fischer
David F. Place
Robert Dockins