ST Vector / STRef -- Performances and allocation

Guillaume Bouchard

18 Jun 2016 18 Jun '16

8:37 p.m.

Hello. I read a reddit post [0] which ends in a debate about haskell performances. Someone gives a C implementation which runs in 15ms on some dataset. I wanted to learn about mutable vector and ST and got something in haskell which runs in 50ms. Code is here [1]. [0] https://www.reddit.com/r/haskell/comments/4ojsn0/counting_inversions_haskell... [1] https://github.com/guibou/inversionsCount/blob/syntax/inversions.hs When I'm profiling the code, I'm surprised to see more than 50% of the %alloc in function which, from my point of view, must not allocate. For example, this function : inc :: forall s. STRef s Int -> ST s () inc v = modifySTRef' v (+1) Is responsible for 10% of the allocation in this program. I was hoping that the "unboxing magic" of GHC may have replaced my Int somewhere by an unboxed one, but I guess the allocation means that the STRef is somehow pointing to a heap allocated Int. Do you know a way to remove theses allocations? As a second question, if you see any other way to improve this implementation, I'll be happy to learn something new ;) Finally, if you read my code, you'll see that I implemented a small DSL around ST / STRef / MVector to make my main function more readable. How do people write readable code in real life ? Especially, I was annoyed when I tried to convert this piece of C code : if (x1 < mid && (x2 == sz || tmp[x1] <= p[x2])) { a; } else { b; } In my haskell code, x1, x2 are STRef, and tmp and p are MVector. This was highly painful to write in haskell and I had to write something such as: x1 <- readSTRef x1' x2 <- readSTRef x2' cond <- if x1 < mid then if x2 == sz then return True else do tmp_x1 <- Vector.read tmp x1 p_x2 <_ Vector.read p x2 return (tmp_x1 <= p_x2) else return False if cond then a else b This is painful, complex and error prone, so is there another solution ? Thank you.

Show replies by date

Carter Schonwald

18 Jun 18 Jun

10:57 p.m.

StRef is a boxed heap value. If you use a single element Unboxed Mutable vector you may find there's less boxing :) On Saturday, June 18, 2016, Guillaume Bouchard < guillaum.bouchard+haskell@gmail.com> wrote:

...

Hello.

I read a reddit post [0] which ends in a debate about haskell performances. Someone gives a C implementation which runs in 15ms on some dataset. I wanted to learn about mutable vector and ST and got something in haskell which runs in 50ms. Code is here [1].

[0] https://www.reddit.com/r/haskell/comments/4ojsn0/counting_inversions_haskell... [1] https://github.com/guibou/inversionsCount/blob/syntax/inversions.hs

When I'm profiling the code, I'm surprised to see more than 50% of the %alloc in function which, from my point of view, must not allocate.

For example, this function :

inc :: forall s. STRef s Int -> ST s () inc v = modifySTRef' v (+1)

Is responsible for 10% of the allocation in this program. I was hoping that the "unboxing magic" of GHC may have replaced my Int somewhere by an unboxed one, but I guess the allocation means that the STRef is somehow pointing to a heap allocated Int.

Do you know a way to remove theses allocations?

As a second question, if you see any other way to improve this implementation, I'll be happy to learn something new ;)

Finally, if you read my code, you'll see that I implemented a small DSL around ST / STRef / MVector to make my main function more readable. How do people write readable code in real life ?

Especially, I was annoyed when I tried to convert this piece of C code :

if (x1 < mid && (x2 == sz || tmp[x1] <= p[x2])) { a; } else { b; }

In my haskell code, x1, x2 are STRef, and tmp and p are MVector. This was highly painful to write in haskell and I had to write something such as:

x1 <- readSTRef x1' x2 <- readSTRef x2'

cond <- if x1 < mid then if x2 == sz then return True else do tmp_x1 <- Vector.read tmp x1 p_x2 <_ Vector.read p x2 return (tmp_x1 <= p_x2) else return False if cond then a else b

This is painful, complex and error prone, so is there another solution ?

Thank you. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org javascript:; http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Amos Robinson

11:08 p.m.

Hi, You might want to add inline pragmas to "inc" and so on. STRef is boxed, so it will allocate. The 'URef' type here might be useful: http://hackage.haskell.org/package/mutable-containers-0.3.2.1/docs/Data-Muta... It is just a wrapper around unboxed mutable vector, as Carter suggests. The sad story is that in my experience, if you really want decent performance you will have to dump the Core and inspect it by hand. You'd then add inline and bang patterns based on the Core. I do not recommend it as a relaxing weekend exercise. I usually use an invocation something like:

...

ghc -ddump-prep -dppr-case-as-let -dsuppress-all -fforce-recomp inversions.hs > inversions.hscore

On Sun, 19 Jun 2016 at 06:38 Guillaume Bouchard < guillaum.bouchard+haskell@gmail.com> wrote:

...

Hello.

I read a reddit post [0] which ends in a debate about haskell performances. Someone gives a C implementation which runs in 15ms on some dataset. I wanted to learn about mutable vector and ST and got something in haskell which runs in 50ms. Code is here [1].

[0] https://www.reddit.com/r/haskell/comments/4ojsn0/counting_inversions_haskell... [1] https://github.com/guibou/inversionsCount/blob/syntax/inversions.hs

When I'm profiling the code, I'm surprised to see more than 50% of the %alloc in function which, from my point of view, must not allocate.

For example, this function :

inc :: forall s. STRef s Int -> ST s () inc v = modifySTRef' v (+1)

Is responsible for 10% of the allocation in this program. I was hoping that the "unboxing magic" of GHC may have replaced my Int somewhere by an unboxed one, but I guess the allocation means that the STRef is somehow pointing to a heap allocated Int.

Do you know a way to remove theses allocations?

As a second question, if you see any other way to improve this implementation, I'll be happy to learn something new ;)

Finally, if you read my code, you'll see that I implemented a small DSL around ST / STRef / MVector to make my main function more readable. How do people write readable code in real life ?

Especially, I was annoyed when I tried to convert this piece of C code :

if (x1 < mid && (x2 == sz || tmp[x1] <= p[x2])) { a; } else { b; }

In my haskell code, x1, x2 are STRef, and tmp and p are MVector. This was highly painful to write in haskell and I had to write something such as:

x1 <- readSTRef x1' x2 <- readSTRef x2'

cond <- if x1 < mid then if x2 == sz then return True else do tmp_x1 <- Vector.read tmp x1 p_x2 <_ Vector.read p x2 return (tmp_x1 <= p_x2) else return False if cond then a else b

This is painful, complex and error prone, so is there another solution ?

Thank you. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Guillaume Bouchard

19 Jun 19 Jun

10:37 a.m.

Thank you for this answer. On Sun, Jun 19, 2016 at 1:08 AM, Amos Robinson wrote:

...

Hi, You might want to add inline pragmas to "inc" and so on. STRef is boxed, so it will allocate. The 'URef' type here might be useful: http://hackage.haskell.org/package/mutable-containers-0.3.2.1/docs/Data-Muta... It is just a wrapper around unboxed mutable vector, as Carter suggests.

I added INLINE pragma on all my functions and it does not change anything unfortunately. I tried the package mutable-containers and the Api is really great, but I'm surprised by the results : - PRef (Unboxed bytearray of size 1, the closest thing to an unboxed stack allocation) : 86ms - URef (Unboxed vector of size 1) : 130ms - SRef (Storable vector of size 1) : 43ms (This is an improvement !) - BRef (Boxed ref) : 137ms - STRef : 54ms (this was my baseline) This is based on the creation of the Ref using a "mutable" function such as : mutable v = asF <$> newRef v (with asF == asPRef, asURef, ...) However, if I'm providing a type signature for mutable, I got totally different results : mutable :: forall (f :: * -> *) a. (Prim a, PrimMonad f) => a -> f (MyRef (PrimState f) a) mutable v = asF <$> newRef v (with type MyRef = PRef, URef, ...) - PRef: 39ms (huge improvement, and winner as expected) - URef: 45ms - SRef: 43ms - BRef: 84ms - STRef: 54ms I don't understand the speed difference which is only impacted by the availability of a signature for mutable. Now much of the allocation due to STRef have disappears. There is still some which annoys me. Most of the allocations comes from that line : https://github.com/guibou/inversionsCount/blob/syntax/inversions.hs#L35 Which is nothing more that a boolean expression between different ST action which are reading from unboxed Vector or PRef. So I was expecting without allocation. Perhaps the fact that it uses "high level" combinators such as "<$>", "<*>" and "=<<" is responsible ?

...

The sad story is that in my experience, if you really want decent performance you will have to dump the Core and inspect it by hand. You'd then add inline and bang patterns based on the Core. I do not recommend it as a relaxing weekend exercise. I usually use an invocation something like:

...
ghc -ddump-prep -dppr-case-as-let -dsuppress-all -fforce-recomp inversions.hs > inversions.hscore

Thank you ;) For now I rather prefer reading intel x86 assembler than GHC core, but today is a rainy day, so I'll use it to start my understanding of the GHC core ;) -- G.

Bertram Felgenhauer

1:12 p.m.

Guillaume Bouchard wrote:

...

- PRef (Unboxed bytearray of size 1, the closest thing to an unboxed stack allocation) : 86ms - URef (Unboxed vector of size 1) : 130ms - SRef (Storable vector of size 1) : 43ms (This is an improvement !) - BRef (Boxed ref) : 137ms - STRef : 54ms (this was my baseline)

You really shouldn't use any mutable variable at all for this, but pass the values around as function arguments instead: count_inv' :: V2.MVector s Int32 -> V2.MVector s Int32 -> ST s Int count_inv' a buf | V.length a <= 1 = return 0 | otherwise = do let len = V.length a mid = len `div` 2 counta <- count_inv' (V.slice 0 mid a) buf countb <- count_inv' (V.slice mid (len - mid) a) buf V.unsafeCopy (V.slice 0 mid buf) (V.slice 0 mid a) let go idx1 idx2 count i = if i == len then return count else do cond <- return (idx1 < mid) .&&. (return (idx2 == len) .||. (V.unsafeRead buf idx1 .<=. V.unsafeRead a idx2)) if cond then do V.unsafeRead buf idx1 >>= V.unsafeWrite a i go (idx1 + 1) idx2 (count + idx2 - mid) (i+1) else do V.unsafeRead a idx2 >>= V.unsafeWrite a i go idx1 (idx2 + 1) count (i+1) go 0 mid (counta + countb) (0 :: Int) Besides, the code spends most of its time on parsing the input. The following more low-level code does the job far more quickly: import Data.ByteString.Char8 as B parse' :: IO [Int32] parse' = do content <- B.getContents return $ map fromIntegral $ unfoldr (B.readInt . B.dropWhile (=='\n')) $ content It's possible to improve this slightly by implementing the code from scratch: parse'' :: IO [Int32] parse'' = do content <- B.getContents return $ go content where go b = case B.uncons b of Nothing -> [] Just ('\n',b) -> go b Just ('-',b) -> go'' 0 b Just (d,b) -> go' (fromIntegral (ord d - 48)) b go' v b = case B.uncons b of Nothing -> [v] Just ('\n',b) -> v : go b Just (d,b) -> go' (v*10 + fromIntegral (ord d - 48)) b go'' v b = case B.uncons b of Nothing -> [v] Just ('\n',b) -> v : go b Just (d,b) -> go' (v*10 - fromIntegral (ord d - 48)) b Taken together these changes improve the runtime from 79ms to 21ms here. Cheers, Bertram

Bertram Felgenhauer

1:44 p.m.

Bertram Felgenhauer via Haskell-Cafe wrote:

...

Guillaume Bouchard wrote:

...
- PRef (Unboxed bytearray of size 1, the closest thing to an unboxed stack allocation) : 86ms - URef (Unboxed vector of size 1) : 130ms - SRef (Storable vector of size 1) : 43ms (This is an improvement !) - BRef (Boxed ref) : 137ms - STRef : 54ms (this was my baseline)

You really shouldn't use any mutable variable at all for this, but pass the values around as function arguments instead [...]

As to why, the reason is that to get good performance, these variables should end up in registers. But the mutable variables in Haskell are all heap-allocated objects, and afaik the compiler has no way of allocating them elsewhere (on the stack, or in a register). So the only way to get good code is to not use Haskell's mutable variables at all. All this is specific to ghc, obviously. Cheers, Bertram

Guillaume Bouchard

2:10 p.m.

Thank you ! I'm now in an equal time with the C code ;) On Sun, Jun 19, 2016 at 3:12 PM, Bertram Felgenhauer via Haskell-Cafe wrote:

...

Guillaume Bouchard wrote:

...
- PRef (Unboxed bytearray of size 1, the closest thing to an unboxed stack allocation) : 86ms - URef (Unboxed vector of size 1) : 130ms - SRef (Storable vector of size 1) : 43ms (This is an improvement !) - BRef (Boxed ref) : 137ms - STRef : 54ms (this was my baseline)

You really shouldn't use any mutable variable at all for this, but pass the values around as function arguments instead:

Nice ! This is so obvious that I did not thought about it. Thank you. (Thank you for the followup-mail, it is clear that it may be easier for GHC to unbox / put in register "normal variable" than more complex type such as the one involved in PRef).

...

Besides, the code spends most of its time on parsing the input. The following more low-level code does the job far more quickly:

Impressive. On the profiling the parsing was accounting only for 10% of the running time. Perhaps the profiler introduces more overhead on the vector part than on the parsing.

...

import Data.ByteString.Char8 as B

parse' :: IO [Int32] parse' = do content <- B.getContents return $ map fromIntegral $ unfoldr (B.readInt . B.dropWhile (=='\n')) $ content

This is in part with the C code. ~16ms

...

It's possible to improve this slightly by implementing the code from scratch:

parse'' :: IO [Int32] parse'' = do content <- B.getContents return $ go content where go b = case B.uncons b of Nothing -> [] Just ('\n',b) -> go b Just ('-',b) -> go'' 0 b Just (d,b) -> go' (fromIntegral (ord d - 48)) b go' v b = case B.uncons b of Nothing -> [v] Just ('\n',b) -> v : go b Just (d,b) -> go' (v*10 + fromIntegral (ord d - 48)) b go'' v b = case B.uncons b of Nothing -> [v] Just ('\n',b) -> v : go b Just (d,b) -> go' (v*10 - fromIntegral (ord d - 48)) b

Taken together these changes improve the runtime from 79ms to 21ms here.

This is better than the C code ;) Thank you. I learned a few thing today. -- Guillaume

Bertram Felgenhauer

3:35 p.m.

Guillaume Bouchard wrote:

...

Thank you ! I'm now in an equal time with the C code ;)

On Sun, Jun 19, 2016 at 3:12 PM, Bertram Felgenhauer via Haskell-Cafe wrote:

...
Guillaume Bouchard wrote:

...
- PRef (Unboxed bytearray of size 1, the closest thing to an unboxed stack allocation) : 86ms - URef (Unboxed vector of size 1) : 130ms - SRef (Storable vector of size 1) : 43ms (This is an improvement !) - BRef (Boxed ref) : 137ms - STRef : 54ms (this was my baseline)

You really shouldn't use any mutable variable at all for this, but pass the values around as function arguments instead:

Nice ! This is so obvious that I did not thought about it. Thank you. (Thank you for the followup-mail, it is clear that it may be easier for GHC to unbox / put in register "normal variable" than more complex type such as the one involved in PRef).

I'd say that the complexity of the type isn't the main obstacle here, but mutability. The point is that when a value is unboxed, one essentially creates a copy of it (on the stack, and possibly on the heap if one ends up re-boxing the value later on). For ordinary Haskell values that is not a problem; they are immutable and a copy is identical, for all purposes, to the original. The only potential downside to unboxing is that some memory may be wasted. With mutable data, however, copies cannot track updates to the original, so unboxing is no longer safe unconditionally. It still could be done if the mutable structure is never shared, but it requires a precise non-sharing analysis. I suspect that this kind of analysis does not (yet?) exist in ghc, because it would not benefit pure code. Cheers, Bertram

Stefan Monnier

20 Jun 20 Jun

2:31 a.m.

...

unconditionally. It still could be done if the mutable structure is never shared, but it requires a precise non-sharing analysis. I suspect that this kind of analysis does not (yet?) exist in ghc, because it would not benefit pure code.

Since these mutable boxes seem to mimick plain C variables, it's indeed better to map them to immutable Haskell variables. After all, that's exactly what the C compiler will do via SSA. GHC could try to do the same with Refs, but it would only be beneficial on "C-style Haskell code": usually when Refs are used, it's precisely because SSA doesn't work for that variable, such as when the equivalent C code would use & on it. Stefan

3310

Age (days ago)

3312

Last active (days ago)

List overview

Download

8 comments

5 participants

participants (5)

Amos Robinson
Bertram Felgenhauer
Carter Schonwald
Guillaume Bouchard
Stefan Monnier