runtime fusion for Data.ByteString.cons ?

I noticed that ByteString is drastically slower than String if I use cons a lot. according to the source, that is expected because of the memcpy for the second parameter. but it seems to me that construction should be able to play the dual trick to deconstruction (which does not copy the tail, but returns an indirection into the original list). roughly speaking: when constructing a ByteString via cons h t, we know the length of the result (1+length t), and instead of creating t in a separate storage location then copying it over to the result, we could try to create it inplace. or rather, we could delay construction of ByteStrings until we know whether we need to allocate fresh memory for them or whether they can be created by filling some context. this rough idea runs into a couple of issues in practice: - first of all, it seems to be a runtime fusion (unless we do whole program optimization, simplifier rewrites won't do, although unfolding recursions might still expose some opportunities for a static variant of this fusion) - if t is shared, we'd like to redirect these shared references to an indirection into the tail of cons h t ignoring the second point for now, if we look into the source for cons, we find something like: cons :: Word8 -> ByteString -> ByteString cons c (PS x s l) = unsafeCreate (l+1) $ \p-> ... poke p c; memcpy (p+1).. now, let's imagine a pre-ByteString as a not-yet allocated ByteString: data PreBS = PrePS l f so that createBS :: PreBS -> ByteString createBS (PrePS l f) = unsafeCreate l f consPre :: Word8 -> ByteString -> PrePS consPre c (PS x s l) = PrePS (l+1) $ \p-> ... poke p c; memcpy (p+1).. cons :: Word8 -> ByteString -> ByteString cons c bs = createBS (consPre c bs) then we could express our fusion as consPre c (create (PrePS l f)) = PrePS (l+1) $ \p-> ... poke p c; f (p+1).. in other words, in a typical map-like recursion scheme, we do not create, copy & release the tails recursively, but delay creation until we know where to embed our ByteString, at which point we do a sequence of pokes, no memcpy. but note that we are matching on a function application here, even though we are not in a simplifier rule, so this doesn't work as it stands.. this is as far as I got so far.. (see attached example for a manual use of pre-ByteStrings to speed up a map). now my questions for you:-) - does this make sense? - can it be made to work? as we probably cannot redirect shared references to t in (cons h t), can we identify the situations where t is not shared, as in a map, or can we just ignore any shared references (they will point to a "create (PrePS ..)" combination, and should just keep working, since we bypass those combinations instead of rewriting them)? a tentative idea would be to overload create so that it produces a proper, allocated ByteString where such is expected, but can also just pass through the PreBS where the context can handle it? class Create r where create :: PreBS -> r instance Create ByteString where create (PrePS l f) = unsafeCreate l f instance Create PreBS where create = id consPre :: Word8 -> ByteString -> PrePS consPre c (PS x s l) = PrePS (l+1) $ \p-> ... poke p c; memcpy (p+1).. cons :: Word8 -> ByteString -> ByteString cons c bs = create (consPre c bs) consPre' :: Word8 -> PrePS -> PrePS consPre' c (PrePS l f) = PrePS (l+1) $ \p-> ... poke p c; f (p+1).. so that "consPre c (create (PrePS ..))" works as normal, while "consPre' c (create (PrePS ..))" uses the fusion path. or something like that;-) Claus

claus.reinke:
I noticed that ByteString is drastically slower than String if I use cons a lot. according to the source, that is expected because of the memcpy for the second parameter.
Just a quick response, before I consider this in detail, in the stream fusion branch of Data.ByteString cons is fusible: cons :: Word8 -> ByteString -> ByteString cons c (PS x s l) = unsafeCreate (l+1) $ \p -> withForeignPtr x $ \f -> do poke p c memcpy (p `plusPtr` 1) (f `plusPtr` s) (fromIntegral l) {-# INLINE [1] cons #-} {-# RULES "FPS cons -> fused" [~1] forall w. cons w = F.strTransformerUp (F.consS w) "FPS cons -> unfused" [1] forall w. F.strTransformerUp (F.consS w) = cons w #-} strTransformerUp :: (Stream -> Stream) -> (ByteString -> ByteString) strTransformerUp f = writeStrUp . f . readStrUp {-# INLINE [0] strTransformerUp #-} consS :: Word8 -> Stream -> Stream consS w (Stream nextx xs0 len) = Stream next' (True :*: xs0) (len+1) where next' (True :*: xs) = Yield w (False :*: xs) next' (_ :*: xs) = case nextx xs of Done -> Done Skip xs' -> Skip (False :*: xs') Yield x xs' -> Yield x (False :*: xs') {-# INLINE [0] consS #-} Also, have you looked at Data.ByteString.Lazy which does have O(1) cons? I'll think about the rest of your proposal after getting some coffee :) -- Don

dons:
claus.reinke:
I noticed that ByteString is drastically slower than String if I use cons a lot. according to the source, that is expected because of the memcpy for the second parameter.
Just a quick response, before I consider this in detail, in the stream fusion branch of Data.ByteString cons is fusible:
cons :: Word8 -> ByteString -> ByteString cons c (PS x s l) = unsafeCreate (l+1) $ \p -> withForeignPtr x $ \f -> do poke p c memcpy (p `plusPtr` 1) (f `plusPtr` s) (fromIntegral l) {-# INLINE [1] cons #-}
{-# RULES "FPS cons -> fused" [~1] forall w. cons w = F.strTransformerUp (F.consS w) "FPS cons -> unfused" [1] forall w. F.strTransformerUp (F.consS w) = cons w #-}
strTransformerUp :: (Stream -> Stream) -> (ByteString -> ByteString) strTransformerUp f = writeStrUp . f . readStrUp {-# INLINE [0] strTransformerUp #-}
consS :: Word8 -> Stream -> Stream consS w (Stream nextx xs0 len) = Stream next' (True :*: xs0) (len+1) where next' (True :*: xs) = Yield w (False :*: xs) next' (_ :*: xs) = case nextx xs of Done -> Done Skip xs' -> Skip (False :*: xs') Yield x xs' -> Yield x (False :*: xs') {-# INLINE [0] consS #-}
Oh, this is slower than it should be, too. Those Bools get in the way of GHC's specConstr optimisation. Instead it shoudl use a strict Either. consS :: Word8 -> Stream -> Stream consS w (Stream nextx xs0 len) = Stream next' (RightS xs0) (len+1) where next' (RightS xs) = Yield w (LeftS xs) next' (LeftS xs) = case nextx xs of Done -> Done Skip xs' -> Skip (LeftS xs') Yield x xs' -> Yield x (LeftS xs') {-# INLINE [0] consS #-} where data EitherS a b = LeftS !a | RightS !b deriving (Eq, Ord ) that should help a bit with the stripping away of constructors in consS. -- Don

[just saw your reply while sending this, so perhaps there's nothing new here? but then why the runtime difference? anyway, here goes nothing :-]
a tentative idea would be to overload create so that it produces a proper, allocated ByteString where such is expected, but can also just pass through the PreBS where the context can handle it?
attached is a variant that seems to do the trick, although I do confess myself slightly surprised that it does (I had thought that simplifier rules would be too late to fix overloaded types..). the core code for mapBS and mapBS' is now the same, but the versions of empty and cons used in the latter can produce either variant of ByteString, unallocated pre-ByteStrings or allocated ByteStrings (the instances of class IsByteString), and cons can also handle both variants as second parameter. so all we do in the simplifier rule is to request that mapBS' should internally use unallocated pre-ByteStrings, converting to allocated ByteString only at the end (this request then forces empty and cons to produce pre-ByteStrings as well, switching the whole recursion over to that representation). as I said, I'm somewhat surprised that this works, but commenting out the rule does have the expected impact on performance.. what do you think? Claus

On Nov 19, 2006, at 11:54 AM, Claus Reinke wrote:
I noticed that ByteString is drastically slower than String if I use cons a lot. according to the source, that is expected because of the memcpy for the second parameter.
Have you considered constructing your strings with unfoldr? It should be able to handle most (all?) of your string producing functions efficiently. Cheers, Spencer Janssen

On Nov 19, 2006, at 11:54 AM, Claus Reinke wrote:
I noticed that ByteString is drastically slower than String if I use cons a lot. according to the source, that is expected because of the memcpy for the second parameter.
Have you considered constructing your strings with unfoldr? It should be able to handle most (all?) of your string producing functions efficiently.
old habits die hard - I still underappreciate unfold;-) perhaps I should expand my habits to include unfold more often, but in this case, I was interested in the performance of naively recursive ByteString programming. and the cons performance was the very first thing I noticed, so I tried to do something about it. I guess I got con(s)fused by the two branches of Data.ByteString: since it is part of base since ghc 6.6, I thought that pulling from the ghc/libraries darcs repository would give me the latest and greatest Data.ByteString, as described in the string rewriting paper. the lack of Yields et al should have tripped me up.. what is the plan for that branch? and if there are issues that prevent an update, shouldn't they be mentioned on the Data.ByteString page? claus

claus.reinke:
On Nov 19, 2006, at 11:54 AM, Claus Reinke wrote:
I noticed that ByteString is drastically slower than String if I use cons a lot. according to the source, that is expected because of the memcpy for the second parameter.
Have you considered constructing your strings with unfoldr? It should be able to handle most (all?) of your string producing functions efficiently.
old habits die hard - I still underappreciate unfold;-)
perhaps I should expand my habits to include unfold more often, but in this case, I was interested in the performance of naively recursive ByteString programming. and the cons performance was the very first thing I noticed, so I tried to do something about it.
I guess I got con(s)fused by the two branches of Data.ByteString: since it is part of base since ghc 6.6, I thought that pulling from the ghc/libraries darcs repository would give me the latest and greatest Data.ByteString, as described in the string rewriting paper. the lack of Yields et al should have tripped me up..
what is the plan for that branch? and if there are issues that prevent an update, shouldn't they be mentioned on the Data.ByteString page?
Sometime before January I expect to tag and release fps 0.9 (with stream fusion). It will then be merged into the base library, and be available with the next GHC release. The api should be identical to the currrent fps 0.8 in the base library, just faster, since its using stream fusion instead of old-style functional array fusion. -- Don

claus.reinke:
On Nov 19, 2006, at 11:54 AM, Claus Reinke wrote:
I noticed that ByteString is drastically slower than String if I use cons a lot. according to the source, that is expected because of the memcpy for the second parameter.
Have you considered constructing your strings with unfoldr? It should be able to handle most (all?) of your string producing functions efficiently.
old habits die hard - I still underappreciate unfold;-)
perhaps I should expand my habits to include unfold more often, but in this case, I was interested in the performance of naively recursive ByteString programming. and the cons performance was the very first thing I noticed, so I tried to do something about it.
I guess I got con(s)fused by the two branches of Data.ByteString: since it is part of base since ghc 6.6, I thought that pulling from the ghc/libraries darcs repository would give me the latest and greatest Data.ByteString, as described in the string rewriting paper. the lack of Yields et al should have tripped me up..
what is the plan for that branch? and if there are issues that prevent an update, shouldn't they be mentioned on the Data.ByteString page?
claus
Of course, you can also use stream-based ByteStrings right now, you just have to remove the Data.ByteString* dirs from base before you build, and then install the unstable fps branch via Cabal. -- Don

Hello Donald, Monday, November 20, 2006, 4:01:05 AM, you wrote:
Of course, you can also use stream-based ByteStrings right now, you just have to remove the Data.ByteString* dirs from base before you build, and then install the unstable fps branch via Cabal.
great! i think that inclusion of fps into base was mistake, and now you should understand why -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Mon, 2006-11-20 at 05:08 +0300, Bulat Ziganshin wrote:
Hello Donald,
Monday, November 20, 2006, 4:01:05 AM, you wrote:
Of course, you can also use stream-based ByteStrings right now, you just have to remove the Data.ByteString* dirs from base before you build, and then install the unstable fps branch via Cabal.
great! i think that inclusion of fps into base was mistake, and now you should understand why
On the other hand it means that we can make other things in base take advantage of Data.ByteString (or Data.PackedString when that becomes available) Duncan

Of course, you can also use stream-based ByteStrings right now, you just have to remove the Data.ByteString* dirs from base before you build, and then install the unstable fps branch via Cabal. great! i think that inclusion of fps into base was mistake, and now you should understand why On the other hand it means that we can make other things in base take advantage of Data.ByteString (or Data.PackedString when that becomes available)
but the current situation sits uneasily between two workable approaches: 1 Data.ByteString evolves independent of main libs: - it should be in its own package, easily updateable - base could still depend on that package 2 Data.ByteString is part of base - the main darcs repository should have the latest version - just as the rest of the repository, development versions between releases might be unstable claus

Hello Claus, Monday, November 20, 2006, 6:09:43 PM, you wrote:
1 Data.ByteString evolves independent of main libs: - it should be in its own package, easily updateable - base could still depend on that package
and FPS will not depend on base? :D -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Claus,
Monday, November 20, 2006, 6:09:43 PM, you wrote:
1 Data.ByteString evolves independent of main libs: - it should be in its own package, easily updateable - base could still depend on that package
and FPS will not depend on base? :D
I think the Data.ByteString issues will be a noop soon. Its unlikely to change much after December, when 0.9 (streams for bytestrings) is tagged and set free. -- Don

Hello Donald, Tuesday, November 21, 2006, 2:29:47 AM, you wrote:
I think the Data.ByteString issues will be a noop soon. Its unlikely to change much after December, when 0.9 (streams for bytestrings) is tagged and set free.
and, indeed, then you will go back in time and include it in ghc 6.6? :) may be you trying to say that fps problem is noop because noone will use versions of your lib newer than 0.8? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Duncan, Monday, November 20, 2006, 4:24:31 PM, you wrote:
great! i think that inclusion of fps into base was mistake, and now you should understand why
On the other hand it means that we can make other things in base take advantage of Data.ByteString (or Data.PackedString when that becomes available)
1) it will be great to not include FPS into base until you will start to implement such things. anyway, you can't change base between major ghc releases and its the whole problem 2) i'm hardly imagine things that really need to be included in base and use FPS. the whole problem, again, is difficulties with changes in base which prevents its whole development. its much, much better to implement new things in other libs which can be easily installed, upgraded and don't suffer from ghc-version dependency. we should think about removing things from base, not in the opposite direction -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sun, 2006-11-19 at 17:54 +0000, Claus Reinke wrote:
I noticed that ByteString is drastically slower than String if I use cons a lot. according to the source, that is expected because of the memcpy for the second parameter.
but it seems to me that construction should be able to play the dual trick to deconstruction (which does not copy the tail, but returns an indirection into the original list).
Another approach which I have considered is to do it directly by just poking into an array but then do cunning things to make it persistent at yet still O(1) in the best case of single-threaded construction. Here the representation: data StringBuilder = StringBuilder (ForeignPtr Word8) Int Int (IORef Int) -- pointer, offset, length and 'current length' So just like the ByteString representation but with an extra IORef Int. The idea is that the IORef tells us the current length of the used part of the memory block. So by comparing the length at the time this StringBuilder value was made with the real current length then we can see if we're using the 'latest' version of the StringBuilder or if it's been appended/prepended to since. If we're using the latest value then we can reserve some space by atomically incrementing the IORef and then directly write into the free space. If we're not starting from the latest value then we incur a O(n) penalty to copy the array. Of course in a sequence of cons/snoc operations to an old value the copying only happens once since now we have a new unshared array. To make this scheme efficient the locking has to be cheap or preferably someone could figure out a lockless version. This could usefully be combined with lazy bytestrings (implemented either as lists or unbalanced trees) to provide time and space efficient O(1) cons and snoc. Duncan
participants (5)
-
Bulat Ziganshin
-
Claus Reinke
-
dons@cse.unsw.edu.au
-
Duncan Coutts
-
Spencer Janssen