Faster IntSet by using BitMaps in the lower branches

Hi, for one of my applications, I am generating huge IntSets (and IntMaps), and I am regularly hitting the memory constraints of my machine. Hence I am looking for a better implementation of IntSets with smaller memory foot print, at least in the case of dense sets, e.g. those where it is likely that a few values are equal in all but the lower 5 or 6 bits. So instead of a simple tree, I re-implemented IntSet as a tree where the nodes contain bit maps of one machine word (32 or 64 entries): Instead of Data.IntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100] * +--* | +--* | | +--* | | | +-- 1 | | | +-- 2 | | +-- 42 | +-- 100 +--* +-- -3 +-- -1 we have Data.DenesIntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100] * +--* | +-- 0 0000000000000000000001000000000000000000000000000000000000000110 | +-- 64 0000000000000000000000000001000000000000000000000000000000000000 +-- -64 1010000000000000000000000000000000000000000000000000000000000000 The results are promising. I have attached a "progress" graph, comparing the performance of IntMap with DenseIntMap. As you can see, performance is better for most common operations, greatly better for intersection and union, and comparable for fold/filter/findMax/findMin. And here is the memory usage of an IntSet with one million members and varying distance between the entries: $ cat membench.hs import qualified Data.IntSet as S import qualified Data.DenseIntSet as DS import System.Environment main = do [what, sizeS, stepS] <- getArgs let list = [0,read stepS .. read stepS * read sizeS] if what `elem` ["r","R"] then S.size ( S.fromList list) `seq` return () else DS.size (DS.fromList list) `seq` return () For IntSet, the memory consumption does not change notably with the step size: $ ./membench +RTS -t -RTS r 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 80M $ ./membench +RTS -t -RTS r 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 80M $ ./membench +RTS -t -RTS r 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 79M $ ./membench +RTS -t -RTS r 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 79M But for DenseIntMap, we see that dense maps are, as desired, much more compact: $ ./membench +RTS -t -RTS d 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 2M $ ./membench +RTS -t -RTS d 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 3M $ ./membench +RTS -t -RTS d 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 18M $ ./membench +RTS -t -RTS d 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 77M I have put the code here: https://github.com/nomeata/containers/blob/denseintmap/Data/DenseIntSet.hs It certainly needs a bit more cleanup, e.g. documenting, commenting and maybe improving the test coverage (as, after all, I added quite a bit of logic that can now go wrong). And I did not test the code on a 32 bit system yet. But I’d like to hear from you whether this could in principle go into the containers library (as IntSet, not as DenseIntSet) or if there is a reason not to use this implementation. Thanks, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

On Sat, 17 Sep 2011, Joachim Breitner wrote:
for one of my applications, I am generating huge IntSets (and IntMaps), and I am regularly hitting the memory constraints of my machine. Hence I am looking for a better implementation of IntSets with smaller memory foot print, at least in the case of dense sets, e.g. those where it is likely that a few values are equal in all but the lower 5 or 6 bits.
It sounds reasonable for me. I guess in my cases the tree will often only contain a few bit-vector nodes.

Hi Joachim,
for one of my applications, I am generating huge IntSets (and IntMaps), and I am regularly hitting the memory constraints of my machine. Hence I am looking for a better implementation of IntSets with smaller memory foot print, at least in the case of dense sets, e.g. those where it is likely that a few values are equal in all but the lower 5 or 6 bits.
So instead of a simple tree, I re-implemented IntSet as a tree where the nodes contain bit maps of one machine word (32 or 64 entries):
Instead of Data.IntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100] * +--* | +--* | | +--* | | | +-- 1 | | | +-- 2 | | +-- 42 | +-- 100 +--* +-- -3 +-- -1
we have
Data.DenesIntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100] * +--* | +-- 0 0000000000000000000001000000000000000000000000000000000000000110 | +-- 64 0000000000000000000000000001000000000000000000000000000000000000 +-- -64 1010000000000000000000000000000000000000000000000000000000000000
The results are promising. I have attached a "progress" graph, comparing the performance of IntMap with DenseIntMap. As you can see, performance is better for most common operations, greatly better for intersection and union, and comparable for fold/filter/findMax/findMin.
And here is the memory usage of an IntSet with one million members and varying distance between the entries:
$ cat membench.hs import qualified Data.IntSet as S import qualified Data.DenseIntSet as DS import System.Environment
main = do [what, sizeS, stepS] <- getArgs let list = [0,read stepS .. read stepS * read sizeS] if what `elem` ["r","R"] then S.size ( S.fromList list) `seq` return () else DS.size (DS.fromList list) `seq` return ()
For IntSet, the memory consumption does not change notably with the step size:
$ ./membench +RTS -t -RTS r 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 80M $ ./membench +RTS -t -RTS r 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 80M $ ./membench +RTS -t -RTS r 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 79M $ ./membench +RTS -t -RTS r 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 79M
But for DenseIntMap, we see that dense maps are, as desired, much more compact:
$ ./membench +RTS -t -RTS d 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 2M $ ./membench +RTS -t -RTS d 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 3M $ ./membench +RTS -t -RTS d 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 18M $ ./membench +RTS -t -RTS d 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 77M
I have put the code here: https://github.com/nomeata/containers/blob/denseintmap/Data/DenseIntSet.hs
It certainly needs a bit more cleanup, e.g. documenting, commenting and maybe improving the test coverage (as, after all, I added quite a bit of logic that can now go wrong). And I did not test the code on a 32 bit system yet.
But I’d like to hear from you whether this could in principle go into the containers library (as IntSet, not as DenseIntSet) or if there is a reason not to use this implementation.
Great job! Yes, this can definitely go into containers as IntSet. It is a neat idea, which vastly decreases memory consumption for dense sets. In case the set is dense many operations are also faster as the tree is of lower depth. The operations which are searching for nonexact value (find minimum, find neighbor of an element, also list elements in ascending order) can go slower, as visible on your findMax. Could you please also generate comparison of and IntSet and DenseIntSet, when the set is sparse (eg [1, 65 .. N])? I would like to see the time complexities then. Also please include toList. If the results are fine, I currently see no reason why not to push the code in. When you are happy with the code, could you please send a pull request on the github containers repo? I (and/or Johan, if he wants) will look at the code. Of course, having good comments and test coverage would be a great help. I am quite busy till Sun 25, but have some time after that. Cheers, Milan

Hi, Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka:
Great job!
thanks for the positive feedback.
Could you please also generate comparison of and IntSet and DenseIntSet, when the set is sparse (eg [1, 65 .. N])? I would like to see the time complexities then. Also please include toList. If the results are fine, I currently see no reason why not to push the code in.
I have attached some benchmarking result. While results are good for member, great for insert, intersection and union, toList is slower for sparse maps. toList is basically a foldr, so I think the culprit is this function: foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a foldrBits shift f x = go shift where STRICT_1_OF_2(go) go bi 0 = x go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) | otherwise = go (succ bi) (n `shiftRL` 1) I’ll try to optimize this function individually now, any suggestions are welcome. Greetings, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

On Sun, 18 Sep 2011, Joachim Breitner wrote:
I have attached some benchmarking result. While results are good for member, great for insert, intersection and union, toList is slower for sparse maps. toList is basically a foldr, so I think the culprit is this function:
foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a foldrBits shift f x = go shift where STRICT_1_OF_2(go) go bi 0 = x go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) | otherwise = go (succ bi) (n `shiftRL` 1)
I’ll try to optimize this function individually now, any suggestions are welcome.
You can certainly do some binary search by masking and comparing with bit patterns like 1 `shiftL` 32 - 1 `shiftL` 16 1 `shiftL` 16 - 1 `shiftL` 0

On Sun, Sep 18, 2011 at 4:13 PM, Henning Thielemann < lemming@henning-thielemann.de> wrote:
On Sun, 18 Sep 2011, Joachim Breitner wrote:
I have attached some benchmarking result. While results are good for
member, great for insert, intersection and union, toList is slower for sparse maps. toList is basically a foldr, so I think the culprit is this function:
foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a foldrBits shift f x = go shift where STRICT_1_OF_2(go) go bi 0 = x go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) | otherwise = go (succ bi) (n `shiftRL` 1)
I’ll try to optimize this function individually now, any suggestions are welcome.
You can certainly do some binary search by masking and comparing with bit patterns like 1 `shiftL` 32 - 1 `shiftL` 16 1 `shiftL` 16 - 1 `shiftL` 0
You can also gain some speed by switching to unsafeShiftRL, to drop the
needless comparison. -Edward

Hi, Am Sonntag, den 18.09.2011, 22:13 +0200 schrieb Henning Thielemann:
On Sun, 18 Sep 2011, Joachim Breitner wrote:
I have attached some benchmarking result. While results are good for member, great for insert, intersection and union, toList is slower for sparse maps. toList is basically a foldr, so I think the culprit is this function:
foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a foldrBits shift f x = go shift where STRICT_1_OF_2(go) go bi 0 = x go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) | otherwise = go (succ bi) (n `shiftRL` 1)
I’ll try to optimize this function individually now, any suggestions are welcome.
You can certainly do some binary search by masking and comparing with bit patterns like 1 `shiftL` 32 - 1 `shiftL` 16 1 `shiftL` 16 - 1 `shiftL` 0
I’d like to avoid the binary search, as it is more expensive for dense sets. Milan’s suggestion of shifts by 6 might be a good compromise. Another approach might be to first use lowestBitSet to start with the lowest bit. In case of only one bit set, it will not iterate further then. I tried adding some strictness annotation to go and see if that helps. According to the attachement, it does, instead of 4 times slower in the worst case (Size 4 million, step 100) it is only ~2,2x slower. What does "Str=DmdType U(L)U(L)m" in -fdump-stranal mean? Acccording to the core, GHCziPrim.uncheckedShiftRLzh is used, and also succ gets properly resolved to GHCziPrim.zpzh. Intersection is slower because the intersection of a Tip with a Bin cannot just be resolved by a single lookup. Might this be related to the following and if yes, what I can do about it? SpecConstr Function `main:Data.DenseIntSet.intersection{v r1dK} [lidx]' has four call patterns, but the limit is 3 Use -fspec-constr-count=n to set the bound Use -dppr-debug to see specialisations Greetings, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

On Sun, 18 Sep 2011, Joachim Breitner wrote:
I’d like to avoid the binary search, as it is more expensive for dense sets. Milan’s suggestion of shifts by 6 might be a good compromise. Another approach might be to first use lowestBitSet to start with the lowest bit. In case of only one bit set, it will not iterate further then.
If lowestBitSet is a fast CPU instruction you can iterate through all bits by clearing the lowest set bit after visit and call lowestBitSet again. However, I am not aware of such a machine instruction on x86.

Am Sonntag, den 18.09.2011, 22:13 +0200 schrieb Henning Thielemann:
On Sun, 18 Sep 2011, Joachim Breitner wrote:
I have attached some benchmarking result. While results are good for member, great for insert, intersection and union, toList is slower for sparse maps. toList is basically a foldr, so I think the culprit is this function:
foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a foldrBits shift f x = go shift where STRICT_1_OF_2(go) go bi 0 = x go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) | otherwise = go (succ bi) (n `shiftRL` 1)
I’ll try to optimize this function individually now, any suggestions are welcome.
You can certainly do some binary search by masking and comparing with bit patterns like 1 `shiftL` 32 - 1 `shiftL` 16 1 `shiftL` 16 - 1 `shiftL` 0
I’d like to avoid the binary search, as it is more expensive for dense sets. Milan’s suggestion of shifts by 6 might be a good compromise. Another approach might be to first use lowestBitSet to start with the lowest bit. In case of only one bit set, it will not iterate further then.
You could use lowestBitSet iteratively, but I am afraid it will be too slow.
I tried adding some strictness annotation to go and see if that helps. According to the attachement, it does, instead of 4 times slower in the worst case (Size 4 million, step 100) it is only ~2,2x slower.
What does "Str=DmdType U(L)U(L)m" in -fdump-stranal mean?
U(L) is the best -- the outer U means "unboxed". The (L) is kinda wrong in this unboxed case -- just consider U(L) as U -- that is the way GHC always dumps unboxed ints.
Acccording to the core, GHCziPrim.uncheckedShiftRLzh is used, and also succ gets properly resolved to GHCziPrim.zpzh.
Shouldn't succ be resolved to +1? If I look correctly, you are incrementing an Int. Aha -- succ is checking, whether the Int is 2^32-1. You should probably use (bi+1) instead of (succ bi).
Intersection is slower because the intersection of a Tip with a Bin cannot just be resolved by a single lookup.
You could improve the two Bin vs Tip cases. When you are in that case, call a function 'lookupTip' which will lookup a corresponding Tip in the larger tree (or returns Nil) and then do the .&.. Currently you are doing unnecessary pattern matchings each time you call intersection. (This is actually how the Data.IntSet.intersection works -- the Bin vs Tip case calls member, which patter matches only the bigger tree.) When you mimic the Data.IntSet.intersection more closely, you should get nearly same complexity. Cheers, Milan
Might this be related to the following and if yes, what I can do about it?
SpecConstr Function `main:Data.DenseIntSet.intersection{v r1dK} [lidx]' has four call patterns, but the limit is 3 Use -fspec-constr-count=n to set the bound Use -dppr-debug to see specialisations
Greetings, Joachim
-- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Hi Milan, Am Sonntag, den 18.09.2011, 23:06 +0200 schrieb Milan Straka:
Intersection is slower because the intersection of a Tip with a Bin cannot just be resolved by a single lookup.
You could improve the two Bin vs Tip cases. When you are in that case, call a function 'lookupTip' which will lookup a corresponding Tip in the larger tree (or returns Nil) and then do the .&.. Currently you are doing unnecessary pattern matchings each time you call intersection. (This is actually how the Data.IntSet.intersection works -- the Bin vs Tip case calls member, which patter matches only the bigger tree.)
When you mimic the Data.IntSet.intersection more closely, you should get nearly same complexity.
Changing the number of specializations did not help, so I implemented the loookupTip (here called intersectBM, for consistency with similar helpers). If you compare comparison-3-spec-patterns.pdf (the base case) with comparison-intersectBM.pdf, it shows that it is even slower than my code, as follows: -- The intersection of one tip with a map intersectBM :: Prefix -> BitMap -> IntSet -> IntSet intersectBM kx bm (Bin p2 m2 l2 r2) | nomatch kx p2 m2 = Nil | zero kx m2 = intersectBM kx bm l2 | otherwise = intersectBM kx bm r2 intersectBM kx bm (Tip kx' bm') | kx == kx' = tip kx (bm .&. bm') | otherwise = Nil intersectBM kx bm Nil = Nil Inlining did not help much. Floating the constant parameters and having an explicit "go" function die not change anything, I guess I can rely on ghc to do that for me. Adding strictness helps a bit, so that for up to medium sized maps performance is equal, even for sparse maps. Not sure what goes wrong for large sets. BTW, note the great performance boost for very dense maps when it comes to intersection. It’s barely readable any more :-) I have attached the full statistics for the current version (comparison.pdf). The solid line is IntMap, the long dashed line is DenseIntMap and the short dashed line (left axis) is the speed-up in percent, so lower is better. Greetings, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

Hi, Am Sonntag, den 18.09.2011, 22:47 +0200 schrieb Joachim Breitner:
I’d like to avoid the binary search, as it is more expensive for dense sets. Milan’s suggestion of shifts by 6 might be a good compromise. Another approach might be to first use lowestBitSet to start with the lowest bit. In case of only one bit set, it will not iterate further then.
I tried adding some strictness annotation to go and see if that helps. According to the attachement, it does, instead of 4 times slower in the worst case (Size 4 million, step 100) it is only ~2,2x slower.
first shifting by lowestBitSet seems to help noticably, at least in my benchmarks, see attachment. Slowdown in the worst case is only 1.25 by now. That is enough for today, the benchmarks take too long and it is late :-) lowestSetBit could maybe be optimized further. There seems to be an operation for that (bsr and bsf), it remains to be checked if this is sufficiently fast. If it is indeed, then this should indeed by done inside the loop, and not just one, as Henning suggests. Can library code easily call such primitive ops directly or would that require support in GHC? Changing succ bi to bi + 1 (not included in attached benchmark yet) seems to have a minor benefit on a pure foldrBits benchmark and a minor degression on a pure toList benchmark (see plot.png) Not sure what that means, probably nothing.
You could improve the two Bin vs Tip cases. When you are in that case, call a function 'lookupTip'...
I’ll do the intersection optimization, although I wonder if ghc would not do that for me if it were allowed to introduce enough call patterns? Good night for now Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka:
Great job!
thanks for the positive feedback.
Could you please also generate comparison of and IntSet and DenseIntSet, when the set is sparse (eg [1, 65 .. N])? I would like to see the time complexities then. Also please include toList. If the results are fine, I currently see no reason why not to push the code in.
I have attached some benchmarking result.
Thank you very much. I am quite surprised by the slowdown in the intersection 100 case -- I would expect the times to be nearly identical for the DenseIntSet and IntSet, as seen in the union case. Any ideas?
While results are good for member, great for insert, intersection and union, toList is slower for sparse maps. toList is basically a foldr, so I think the culprit is this function:
foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a foldrBits shift f x = go shift where STRICT_1_OF_2(go) go bi 0 = x go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) | otherwise = go (succ bi) (n `shiftRL` 1)
I’ll try to optimize this function individually now, any suggestions are welcome.
We have to be sure there is no unneeded slowdown because of inefficient Core code -- that all numbers are unpacked, succ is unpacked to primop, a primop is used for the shift, and so on. Also, you can use some additional bit tricks -- for example the following heuristic, which skips last 6 bits if they are all zeroes:
go bi 0 = x {- heuristic -} go bi n | n .&. 0x3f == 0 = go (bi + 6) (n `shiftRL` 6) go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) | otherwise = go (succ bi) (n `shiftRL` 1)
Of course, the 6 used can be changed to some other number - some measurement should be made. I expected a square root of the number of bits to behave reasonably well and 6 seemed to be a good compromise for 32 and 64. Cheers, Milan

There are some neat tricks you can use using deBruijn multiplication to
optimize finding the least significant set bit.
My geometric coalgebra code in
https://github.com/ekmett/algebra/blob/master/Numeric/Coalgebra/Geometric.hs...
the following:
lsb :: *Word64* -> Int
lsb n = fromIntegral $ ix ! shiftR ((n .&. (-n)) * 0x07EDD5E59A4E28C2) 58
where
-- a 64 bit deBruijn multiplication table
ix :: UArray Word64 Word8
ix = listArray (0, 63)
[ 63, 0, 58, 1, 59, 47, 53, 2
, 60, 39, 48, 27, 54, 33, 42, 3
, 61, 51, 37, 40, 49, 18, 28, 20
, 55, 30, 34, 11, 43, 14, 22, 4
, 62, 57, 46, 52, 38, 26, 32, 41
, 50, 36, 17, 19, 29, 10, 13, 21
, 56, 45, 25, 31, 35, 16, 9, 12
, 44, 24, 15, 8, 23, 7, 6, 5
]
which could be sped up slightly by using a byteArray# directly for the
backing store to find the least significant set bit in a Word64.
-Edward
On Sun, Sep 18, 2011 at 4:01 PM, Joachim Breitner
Hi,
Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka:
Great job!
thanks for the positive feedback.
Could you please also generate comparison of and IntSet and DenseIntSet, when the set is sparse (eg [1, 65 .. N])? I would like to see the time complexities then. Also please include toList. If the results are fine, I currently see no reason why not to push the code in.
I have attached some benchmarking result. While results are good for member, great for insert, intersection and union, toList is slower for sparse maps. toList is basically a foldr, so I think the culprit is this function:
foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a foldrBits shift f x = go shift where STRICT_1_OF_2(go) go bi 0 = x go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) | otherwise = go (succ bi) (n `shiftRL` 1)
I’ll try to optimize this function individually now, any suggestions are welcome.
Greetings, Joachim
-- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

I should have mentioned. This function asumes that the input n is not equal
to 0.
You can use it or a nicely ByteArray#'d and uncheckedShiftRL64#'d version
together with the iterated masking of bits to walk through the bits in
order.
-Edward
On Sun, Sep 18, 2011 at 5:07 PM, Edward Kmett
There are some neat tricks you can use using deBruijn multiplication to optimize finding the least significant set bit.
My geometric coalgebra code in https://github.com/ekmett/algebra/blob/master/Numeric/Coalgebra/Geometric.hs... the following:
lsb :: *Word64* -> Int lsb n = fromIntegral $ ix ! shiftR ((n .&. (-n)) * 0x07EDD5E59A4E28C2) 58 where -- a 64 bit deBruijn multiplication table ix :: UArray Word64 Word8 ix = listArray (0, 63) [ 63, 0, 58, 1, 59, 47, 53, 2 , 60, 39, 48, 27, 54, 33, 42, 3 , 61, 51, 37, 40, 49, 18, 28, 20 , 55, 30, 34, 11, 43, 14, 22, 4 , 62, 57, 46, 52, 38, 26, 32, 41 , 50, 36, 17, 19, 29, 10, 13, 21 , 56, 45, 25, 31, 35, 16, 9, 12 , 44, 24, 15, 8, 23, 7, 6, 5 ]
which could be sped up slightly by using a byteArray# directly for the backing store to find the least significant set bit in a Word64.
-Edward
On Sun, Sep 18, 2011 at 4:01 PM, Joachim Breitner < mail@joachim-breitner.de> wrote:
Hi,
Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka:
Great job!
thanks for the positive feedback.
Could you please also generate comparison of and IntSet and DenseIntSet, when the set is sparse (eg [1, 65 .. N])? I would like to see the time complexities then. Also please include toList. If the results are fine, I currently see no reason why not to push the code in.
I have attached some benchmarking result. While results are good for member, great for insert, intersection and union, toList is slower for sparse maps. toList is basically a foldr, so I think the culprit is this function:
foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a foldrBits shift f x = go shift where STRICT_1_OF_2(go) go bi 0 = x go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) | otherwise = go (succ bi) (n `shiftRL` 1)
I’ll try to optimize this function individually now, any suggestions are welcome.
Greetings, Joachim
-- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Hi, Am Sonntag, den 18.09.2011, 17:11 -0400 schrieb Edward Kmett:
I should have mentioned. This function asumes that the input n is not equal to 0.
that is fine, that is in invariant that holds for my bit array. But I guess we’d need to use CPP magic to separate bitness – my code uses whatever "Word" means. Or does it work gracefully on 32? How about adding it to Data.Word (then I don’t feel responsible for getting it fast and can just use it :-)) Thanks, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

On Sun, Sep 18, 2011 at 5:20 PM, Joachim Breitner
Hi,
Am Sonntag, den 18.09.2011, 17:11 -0400 schrieb Edward Kmett:
I should have mentioned. This function asumes that the input n is not equal to 0.
that is fine, that is in invariant that holds for my bit array.
But I guess we’d need to use CPP magic to separate bitness – my code uses whatever "Word" means. Or does it work gracefully on 32?
How about adding it to Data.Word (then I don’t feel responsible for getting it fast and can just use it :-))
You can use a smaller DeBruijn multiplication table to handle 32 bits, but the one that I gave puts the answer in the most significant 6 bits of a 64 bit word, so if you use it with a 32 bit word, they'll be setting bits you don't have and all answers will be 0. Also, if you don't care about the particular order in which you visit the bits you could store the bits in a different order in the Word(64) than you would if you just set them directly by preshuffling them. I was able to derive the following operations, which move the repeated lookups out of the fold: magic = 0x07EDD5E59A4E28C2 offset = 58 preshuffleTable :: UArray Int Word8 preshuffleTable = listArray (0,63) [63,0,58,1,59,47,53,2,60,39,48,27,54,33,42,3,61,51,37,40,49,18,28,20,55,30,34,11,43,14,22,4, 62,57,46,52,38,26,32,41,50,36,17,19,29,10,13,21,56,45,25,31,35,16,9,12,44,24,15,8,23,7,6,5] preshuffle :: Int -> Int preshuffle n = fromIntegral (preshuffleTable ! n) setShuffled :: Word64 -> Int -> Word64 setShuffled w n = setBit w (preshuffle n) testShuffled :: Word64 -> Int -> Bool testShuffled w n = testBit w (preshuffle n) clearShuffled :: Word64 -> Int -> Word64 clearShuffled w n = clearBit w (preshuffle n) -- NB: the bits do not come out in order! foldShuffledBits :: (Int -> a -> a) -> a -> Word64 -> a foldShuffledBits f z 0 = z foldShuffledBits f z n | t <- n .&. negate n = f (fromIntegral (shiftR (t * magic) offset)) (foldShuffledBits f z (xor n t)) unshuffleTable :: UArray Int Word8 unshuffleTable = listArray (0,63) [1,3,7,15,31,63,62,61,59,54,45,27,55,46,29,58,53,42,21,43,23,47,30,60,57,50,37,11,22,44,25, 51,38,13,26,52,41,18,36,9,19,39,14,28,56,49,34,5,10,20,40,17,35,6,12,24,48,33,2,4,8,16,32,0] I can derive a similar magic, offset and preShuffle table could be derived for 32 bits, I just haven't needed it. The caveat is that you'd get the bits out in the wrong order for toAscList, but if you are concerned with the performance of foldBits it may be a win in other areas and you can loop and test shuffled bits for when you really do need the values in ascending order. -Edward

Probably the simplest version is magic = 0x07EDD5E59A4E28C2 offset = 58 lsbTable :: UArray Word64 Word8 lsbTable = listArray (0, 63) [ 63, 0, 58, 1, 59, 47, 53, 2 , 60, 39, 48, 27, 54, 33, 42, 3 , 61, 51, 37, 40, 49, 18, 28, 20 , 55, 30, 34, 11, 43, 14, 22, 4 , 62, 57, 46, 52, 38, 26, 32, 41 , 50, 36, 17, 19, 29, 10, 13, 21 , 56, 45, 25, 31, 35, 16, 9, 12 , 44, 24, 15, 8, 23, 7, 6, 5 ] foldrBits :: (Int -> a -> a) -> a -> Word64 -> a foldrBits f z 0 = z foldrBits f z n | t <- n .&. negate n = f (fromIntegral (lsbTable ! shiftR (t * magic) offset)) (foldrBits f z (xor n t)) swapping to lsbTable = listArray (0,31) [0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9] magic = 0x077CB531 offset = 27 for the 32 bit case. -Edward

Hi, Am Sonntag, den 18.09.2011, 18:46 -0400 schrieb Edward Kmett:
You can use a smaller DeBruijn multiplication table to handle 32 bits, but the one that I gave puts the answer in the most significant 6 bits of a 64 bit word, so if you use it with a 32 bit word, they'll be setting bits you don't have and all answers will be 0.
I’m not sure how to cleanly select the right implementation inside Data.IntMap. Also, we have architectures with 31 bit words (s390, at least). If anything, then the functions ought to move into the Bits type class and into GHC.Word. There it makes also more sense to select architecture-specific implementations, e.g. the primitive operation on x86.
Also, if you don't care about the particular order in which you visit the bits you could store the bits in a different order in the Word(64) than you would if you just set them directly by preshuffling them.
The order is unfortunately relevant, and I am not sure I have a use for an arbitrarily ordered fold. Greetings, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

The usual mechanism would be to just bundle a pair of foreign functions and
a fallback purely functional implementation.
I'll see about packaging up a Data.Bits.DeBruijn module somewhere with few
if any dependencies.
-Edward
On Mon, Sep 19, 2011 at 8:25 AM, Joachim Breitner
Hi,
Am Sonntag, den 18.09.2011, 18:46 -0400 schrieb Edward Kmett:
You can use a smaller DeBruijn multiplication table to handle 32 bits, but the one that I gave puts the answer in the most significant 6 bits of a 64 bit word, so if you use it with a 32 bit word, they'll be setting bits you don't have and all answers will be 0.
I’m not sure how to cleanly select the right implementation inside Data.IntMap. Also, we have architectures with 31 bit words (s390, at least). If anything, then the functions ought to move into the Bits type class and into GHC.Word. There it makes also more sense to select architecture-specific implementations, e.g. the primitive operation on x86.
Also, if you don't care about the particular order in which you visit the bits you could store the bits in a different order in the Word(64) than you would if you just set them directly by preshuffling them.
The order is unfortunately relevant, and I am not sure I have a use for an arbitrarily ordered fold.
Greetings, Joachim
-- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

On Sun, Sep 18, 2011 at 11:07 PM, Edward Kmett
There are some neat tricks you can use using deBruijn multiplication to optimize finding the least significant set bit. My geometric coalgebra code in https://github.com/ekmett/algebra/blob/master/Numeric/Coalgebra/Geometric.hs... uses the following:
I use this in hashtables also :)
https://github.com/gregorycollins/hashtables/blob/master/src/Data/HashTable/...
--
Gregory Collins

Hi, Am Montag, den 19.09.2011, 16:03 +0200 schrieb Gregory Collins:
On Sun, Sep 18, 2011 at 11:07 PM, Edward Kmett
wrote: There are some neat tricks you can use using deBruijn multiplication to optimize finding the least significant set bit. My geometric coalgebra code in https://github.com/ekmett/algebra/blob/master/Numeric/Coalgebra/Geometric.hs... uses the following:
I use this in hashtables also :)
https://github.com/gregorycollins/hashtables/blob/master/src/Data/HashTable/...
hmm... -- only works with 32-bit values -- ok for us here firstBitSet# :: Int# -> Int# That’s why I feel uneasy about this method. But now we have already three users of a firstBitSet function, all of which had to re-implement the wheel. This is really a strong argument for putting this as a method into Data.Bits and the code into GHC.Word, where machine-specific instructions can be used if found to be faster. Greetings, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

Hi, Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka:
When you are happy with the code, could you please send a pull request on the github containers repo? I (and/or Johan, if he wants) will look at the code. Of course, having good comments and test coverage would be a great help. I am quite busy till Sun 25, but have some time after that.
I shaped the code into something that I am happy with so far. I have greatly extended the test suite, so that it covers every non-trivial non-debugging function, and indeed I found and fixed some bugs in split. https://github.com/haskell/containers/pull/3 The git commit history might not be as clean as it could be using rebase -i, but I think it is still worth having it in the repo as it allows one to later jump to, e.g. 4f4ca91, where both IntSet and DenseIntSet and my benchmarking scrips are available. Greetings, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/
participants (5)
-
Edward Kmett
-
Gregory Collins
-
Henning Thielemann
-
Joachim Breitner
-
Milan Straka