Mixed boxed/unboxed arrays?

J. Reinders

2 Aug 2022 2 Aug '22

1:31 p.m.

Hi GHC devs, I’ve been investigating fast hash table implementations. In particular hash tables used for counting unique items. For this use case, I believe the most performant hash tables are, in C terms, arrays of structures with a (boxed) pointer to the key, which is the item that we are counting, and an (unboxed) integer which holds the actual count. I already know of the ‘vector-hashtables’ package which uses two separate arrays, for example one boxed to hold the keys and one unboxed to hold the counts. However, I believe it can be quite important to store all the elements in the same array as that can reduce the number of cache misses. Because with random access to two arrays there is a higher chance that there will be two cache misses even if it immediately finds the right key in the hash table. So, I have also been looking at the low level arrays from the ‘primitive’ package and even in GHC.Exts, but I don’t believe it is currently possible to create a single array that contains both boxed and unboxed elements. Have I overlooked something? Or else, would it be possible to support this use case in a future version of GHC? Cheers, Jaro

Show replies by date

Viktor Dukhovni

2 Aug 2 Aug

3:06 p.m.

On Tue, Aug 02, 2022 at 03:31:57PM +0200, J. Reinders wrote:

...

I’ve been investigating fast hash table implementations. In particular hash tables used for counting unique items. For this use case, I believe the most performant hash tables are, in C terms, arrays of structures with a (boxed) pointer to the key, which is the item that we are counting, and an (unboxed) integer which holds the actual count.

I already know of the ‘vector-hashtables’ package which uses two separate arrays, for example one boxed to hold the keys and one unboxed to hold the counts. However, I believe it can be quite important to store all the elements in the same array as that can reduce the number of cache misses. Because with random access to two arrays there is a higher chance that there will be two cache misses even if it immediately finds the right key in the hash table.

Could you use `StablePtr` for the keys? https://downloads.haskell.org/~ghc/latest/docs/html/libraries/base-4.16.1.0/... The corresponding `Ptr` can be stored in an unboxed Storable array along with the count. This comes at the cost of later having to explicitly free each StablePtr. https://downloads.haskell.org/~ghc/latest/docs/html/libraries/base-4.16.1.0/... How does the cost of computing object hashes and comparing colliding objects compare with the potential cache miss cost of using boxed integers or a separate array? Would such an "optimisation" be worth the effort? -- Viktor.

J. Reinders

3:32 p.m.

...

Could you use `StablePtr` for the keys?

That might be an option, but I have no idea how performant stable pointers are and manual management is obviously not ideal.

...

How does the cost of computing object hashes and comparing colliding objects compare with the potential cache miss cost of using boxed integers or a separate array? Would such an "optimisation" be worth the effort?

Literature on hash tables suggests that cache misses were a very important factor in running time (in 2001): https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.25.4189 I don’t know whether it has become less or more important now, but I have been told there haven’t been that many advances in memory latency.

Andreas Klebinger

6:24 p.m.

I think it's possible to do this *today* using unsafeCoerce#. I was able to come up with this basic example below. In practice one would at the very least want to abstract away the gnarly stuff inside a library. But since it sounds like you want to be the one to write a library that shouldn't be a problem. {-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-} {-# LANGUAGE UnliftedDatatypes #-} moduleMainwhere importGHC.Exts importGHC.IO importUnsafe.Coerce importData.Kind dataSA= SA (SmallMutableArray# RealWorldAny) mkArray:: Int-> a-> IO(SA) mkArray (I# n) initial = IO $ \s -> caseunsafeCoerce# (newSmallArray# n initial s) of (# s', arr #) -> (# s', SA arr #) readLifted:: SA-> Int-> IOa readLifted (SA arr) (I# i) = IO (\s -> unsafeCoerce# (readSmallArray# arr i s) ) dataUWrap(a:: UnliftedType) = UWrap a -- UWrap is just here because we can't return unlifted types in IO -- If you don't need your result in IO you can eliminate this indirection. readUnlifted:: foralla. SA-> Int-> IO(UWrapa) readUnlifted (SA arr) (I# i) = IO (\s -> caseunsafeCoerce# (readSmallArray# arr i s) of (# s', a :: a#) -> (# s', UWrap a #) ) writeLifted:: a-> Int-> SA-> IO() writeLifted x (I# i) (SA arr) = IO $ \s -> casewriteSmallArray# (unsafeCoerce# arr) i x s of s -> (# s, ()#) writeUnlifted:: (a:: UnliftedType) -> Int-> SA-> IO() writeUnlifted x (I# i) (SA arr) = IO $ \s -> casewriteSmallArray# arr i (unsafeCoerce# x) s of s -> (# s, ()#) typeUB:: UnliftedType dataUB= UT | UF showU:: UWrapUB-> String showU (UWrap UT) = "UT" showU (UWrap UF) = "UF" main:: IO() main = do arr <- mkArray 4() writeLifted True 0arr writeLifted False 1arr writeUnlifted UT 2arr writeUnlifted UT 3arr (readLifted arr 0:: IOBool) >>= print (readLifted arr 1:: IOBool) >>= print (readUnlifted arr 2:: IO(UWrapUB)) >>= (putStrLn . showU) (readUnlifted arr 3:: IO(UWrapUB)) >>= (putStrLn . showU) return () Cheers Andreas Am 02/08/2022 um 17:32 schrieb J. Reinders:

...

...
Could you use `StablePtr` for the keys? That might be an option, but I have no idea how performant stable pointers are and manual management is obviously not ideal.

...
How does the cost of computing object hashes and comparing colliding objects compare with the potential cache miss cost of using boxed integers or a separate array? Would such an "optimisation" be worth the effort? Literature on hash tables suggests that cache misses were a very important factor in running time (in 2001):https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.25.4189

I don’t know whether it has become less or more important now, but I have been told there haven’t been that many advances in memory latency. _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Jaro Reinders

6:51 p.m.

It seems you have misunderstood me. I want to store *unboxed* Int#s inside the array, not just some unlifted types. Surely in the case of unboxed integers the unsafeCoerce# function can make the garbage collector crash as they might be interpreted as arbitrary pointers. Cheers, Jaro On 02/08/2022 20:24, Andreas Klebinger wrote:

...

I think it's possible to do this *today* using unsafeCoerce#.

I was able to come up with this basic example below. In practice one would at the very least want to abstract away the gnarly stuff inside a library. But since it sounds like you want to be the one to write a library that shouldn't be a problem.

{-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-} {-# LANGUAGE UnliftedDatatypes #-} moduleMainwhere importGHC.Exts importGHC.IO importUnsafe.Coerce importData.Kind dataSA= SA (SmallMutableArray# RealWorldAny) mkArray:: Int-> a-> IO(SA) mkArray (I# n) initial = IO $ \s -> caseunsafeCoerce# (newSmallArray# n initial s) of (# s', arr #) -> (# s', SA arr #) readLifted:: SA-> Int-> IOa readLifted (SA arr) (I# i) = IO (\s -> unsafeCoerce# (readSmallArray# arr i s) ) dataUWrap(a:: UnliftedType) = UWrap a -- UWrap is just here because we can't return unlifted types in IO -- If you don't need your result in IO you can eliminate this indirection. readUnlifted:: foralla. SA-> Int-> IO(UWrapa) readUnlifted (SA arr) (I# i) = IO (\s -> caseunsafeCoerce# (readSmallArray# arr i s) of (# s', a :: a#) -> (# s', UWrap a #) ) writeLifted:: a-> Int-> SA-> IO() writeLifted x (I# i) (SA arr) = IO $ \s -> casewriteSmallArray# (unsafeCoerce# arr) i x s of s -> (# s, ()#) writeUnlifted:: (a:: UnliftedType) -> Int-> SA-> IO() writeUnlifted x (I# i) (SA arr) = IO $ \s -> casewriteSmallArray# arr i (unsafeCoerce# x) s of s -> (# s, ()#) typeUB:: UnliftedType dataUB= UT | UF showU:: UWrapUB-> String showU (UWrap UT) = "UT" showU (UWrap UF) = "UF" main:: IO() main = do arr <- mkArray 4() writeLifted True 0arr writeLifted False 1arr writeUnlifted UT 2arr writeUnlifted UT 3arr (readLifted arr 0:: IOBool) >>= print (readLifted arr 1:: IOBool) >>= print (readUnlifted arr 2:: IO(UWrapUB)) >>= (putStrLn . showU) (readUnlifted arr 3:: IO(UWrapUB)) >>= (putStrLn . showU) return ()

Cheers

Andreas

Am 02/08/2022 um 17:32 schrieb J. Reinders:

...
...
Could you use `StablePtr` for the keys? That might be an option, but I have no idea how performant stable pointers are and manual management is obviously not ideal.

...
How does the cost of computing object hashes and comparing colliding objects compare with the potential cache miss cost of using boxed integers or a separate array? Would such an "optimisation" be worth the effort? Literature on hash tables suggests that cache misses were a very important factor in running time (in 2001):https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.25.4189

I don’t know whether it has become less or more important now, but I have been told there haven’t been that many advances in memory latency. _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Andreas Klebinger

7:25 p.m.

Indeed I misunderstood. As you already suspected this wouldn't work for Int# (or other unboxed types) sadly as the GC would assume these to be pointers which no doubt would lead to segfaults or worse. Rereading your initial mail I can say the runtime currently doesn't support such a heap object. If I understand you correctly what you would like is basically a something like: Con n P I# P I# P I# ... \/ \/\/ Pair1 Pair2 Pair3 ... Where n gives the number of pairs. I can see how it might be feasible to add a heap object like this to GHC but I'm unsure if it would be worth the complexity as it's layout diverges quite a bit from what GHC usually expects. The other option would be to expose to users a way to have an object that consist of a given number of words and a bitmap which indicates to the GHC which fields are pointers. This is more or less the representation that's already used to deal with stack frames iirc so that might not be as far fetched as it seems at first. It might even be possible to implement some sort of prototype for this using hand written Cmm. But there are not any plans to implement anything like this as far as I know. Am 02/08/2022 um 20:51 schrieb Jaro Reinders:

...

It seems you have misunderstood me. I want to store *unboxed* Int#s inside the array, not just some unlifted types. Surely in the case of unboxed integers the unsafeCoerce# function can make the garbage collector crash as they might be interpreted as arbitrary pointers.

Cheers,

Jaro

On 02/08/2022 20:24, Andreas Klebinger wrote:

...
I think it's possible to do this *today* using unsafeCoerce#.

I was able to come up with this basic example below. In practice one would at the very least want to abstract away the gnarly stuff inside a library. But since it sounds like you want to be the one to write a library that shouldn't be a problem.

{-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-} {-# LANGUAGE UnliftedDatatypes #-} moduleMainwhere importGHC.Exts importGHC.IO importUnsafe.Coerce importData.Kind dataSA= SA (SmallMutableArray# RealWorldAny) mkArray:: Int-> a-> IO(SA) mkArray (I# n) initial = IO $ \s -> caseunsafeCoerce# (newSmallArray# n initial s) of (# s', arr #) -> (# s', SA arr #) readLifted:: SA-> Int-> IOa readLifted (SA arr) (I# i) = IO (\s -> unsafeCoerce# (readSmallArray# arr i s) ) dataUWrap(a:: UnliftedType) = UWrap a -- UWrap is just here because we can't return unlifted types in IO -- If you don't need your result in IO you can eliminate this indirection. readUnlifted:: foralla. SA-> Int-> IO(UWrapa) readUnlifted (SA arr) (I# i) = IO (\s -> caseunsafeCoerce# (readSmallArray# arr i s) of (# s', a :: a#) -> (# s', UWrap a #) ) writeLifted:: a-> Int-> SA-> IO() writeLifted x (I# i) (SA arr) = IO $ \s -> casewriteSmallArray# (unsafeCoerce# arr) i x s of s -> (# s, ()#) writeUnlifted:: (a:: UnliftedType) -> Int-> SA-> IO() writeUnlifted x (I# i) (SA arr) = IO $ \s -> casewriteSmallArray# arr i (unsafeCoerce# x) s of s -> (# s, ()#) typeUB:: UnliftedType dataUB= UT | UF showU:: UWrapUB-> String showU (UWrap UT) = "UT" showU (UWrap UF) = "UF" main:: IO() main = do arr <- mkArray 4() writeLifted True 0arr writeLifted False 1arr writeUnlifted UT 2arr writeUnlifted UT 3arr (readLifted arr 0:: IOBool) >>= print (readLifted arr 1:: IOBool) >>= print (readUnlifted arr 2:: IO(UWrapUB)) >>= (putStrLn . showU) (readUnlifted arr 3:: IO(UWrapUB)) >>= (putStrLn . showU) return ()

Cheers

Andreas

Am 02/08/2022 um 17:32 schrieb J. Reinders:

...
...
Could you use `StablePtr` for the keys? That might be an option, but I have no idea how performant stable pointers are and manual management is obviously not ideal.

...
How does the cost of computing object hashes and comparing colliding objects compare with the potential cache miss cost of using boxed integers or a separate array? Would such an "optimisation" be worth the effort? Literature on hash tables suggests that cache misses were a very important factor in running time (in 2001):https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.25.4189

I don’t know whether it has become less or more important now, but I have been told there haven’t been that many advances in memory latency. _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Viktor Dukhovni

3 Aug 3 Aug

3:30 a.m.

On Tue, Aug 02, 2022 at 05:32:58PM +0200, J. Reinders wrote:

...

...
Could you use `StablePtr` for the keys?

That might be an option, but I have no idea how performant stable pointers are and manual management is obviously not ideal.

If your hash table keys qualify for being stored in a "compact region", you may not need per-key stable pointers, just (carefully) coercing the keys to pointers suffices to produce primitive "handles" that are stable for the lifetime of the "compact region". The inverse (unsafe) coercion recovers the key. This also has the advantage that a key count does not incur a high ongoing GC cost. The keys are of course copied into the compact region. With this you could store "pointer + count" in a primitive cell. The hash table then holds a reference to the compact region and compacts keys on insert. https://hackage.haskell.org/package/compact-0.2.0.0/docs/Data-Compact.html -- Viktor.

J. Reinders

8:16 p.m.

Thanks for your suggestion. That sounds like a promising technique. I have an implementation that mostly works here: https://github.com/noughtmare/clutter in the src/Counter.hs file. The only problem is that I get segfaults or internal GHC errors if I run it on large files. I’ve adding some tracing and it seems to occur when I try to coerce back pointers from the hash table array to proper Haskell values in the ’toList’ function. I can reproduce the problem by running ‘bash test.sh’. Currently, I’m using the ‘ptrToAny' and ‘anyToPtr' functions to do the coercing, because that sounds like the safest option. Do you know what’s going wrong or do you have a safer design for coercing the pointers? I thought it might be because the compact region gets deallocated before all the pointers are extracted, but even if I add a ’touch c’ (where c contains the compact region) at the end it still gives the same errors.

...

On 3 Aug 2022, at 05:30, Viktor Dukhovni wrote:

On Tue, Aug 02, 2022 at 05:32:58PM +0200, J. Reinders wrote:

...
...
Could you use `StablePtr` for the keys?

That might be an option, but I have no idea how performant stable pointers are and manual management is obviously not ideal.

If your hash table keys qualify for being stored in a "compact region", you may not need per-key stable pointers, just (carefully) coercing the keys to pointers suffices to produce primitive "handles" that are stable for the lifetime of the "compact region". The inverse (unsafe) coercion recovers the key.

This also has the advantage that a key count does not incur a high ongoing GC cost. The keys are of course copied into the compact region.

With this you could store "pointer + count" in a primitive cell. The hash table then holds a reference to the compact region and compacts keys on insert.

https://hackage.haskell.org/package/compact-0.2.0.0/docs/Data-Compact.html

-- Viktor. _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

J. Reinders

8:35 p.m.

I found the mistake: compactAdd c k p <- anyToPtr k Should be: p <- anyToPtr . getCompact =<< compactAdd c k Otherwise I guess I’m not using the pointer that’s on the compact region.

Viktor Dukhovni

8:51 p.m.

On Wed, Aug 03, 2022 at 10:35:43PM +0200, J. Reinders wrote:

...

I found the mistake:

compactAdd c k p <- anyToPtr k

Should be:

p <- anyToPtr . getCompact =<< compactAdd c k

Otherwise I guess I’m not using the pointer that’s on the compact region.

Correct, I started my reply to your previous message before seeing that you also found the same error. -- Viktor.

Viktor Dukhovni

8:50 p.m.

On Wed, Aug 03, 2022 at 10:16:50PM +0200, J. Reinders wrote:

...

I have an implementation that mostly works here: https://github.com/noughtmare/clutter in the src/Counter.hs file.

The only problem is that I get segfaults or internal GHC errors if I run it on large files. I’ve adding some tracing and it seems to occur when I try to coerce back pointers from the hash table array to proper Haskell values in the ’toList’ function.

Yes, this is delicate, requiring detailed knowledge of the internals.

...

Currently, I’m using the ‘ptrToAny' and ‘anyToPtr' functions to do the coercing, because that sounds like the safest option.

Do you know what’s going wrong or do you have a safer design for coercing the pointers?

The code at: https://github.com/noughtmare/clutter/blob/main/src/Counter.hs#L50-L52 looks wrong. You're ignoring the return value of `compactAdd`, and coercing the original (non-compact) key to a pointer, but this is liable to be moved by GC. You need something like: p <- addCompact c k >>= getCompact >>= anyToPtr

...

I thought it might be because the compact region gets deallocated before all the pointers are extracted, but even if I add a ’touch c’ (where c contains the compact region) at the end it still gives the same errors.

Given the issue above, it is too early to speculate along these lines. It may also turn out that once the code works, it may be no faster or even much slower than the two-array approach. Compacting new keys has a cost, and perhaps that will dominate any speedup from combining the key and value in the same primitive cell. -- Viktor.

1076

Age (days ago)

1077

Last active (days ago)

List overview

Download

10 comments

4 participants

participants (4)

Andreas Klebinger
J. Reinders
Jaro Reinders
Viktor Dukhovni

Mixed boxed/unboxed arrays?

tags

participants (4)