
I noticed that ByteString is drastically slower than String if I use cons a lot. according to the source, that is expected because of the memcpy for the second parameter. but it seems to me that construction should be able to play the dual trick to deconstruction (which does not copy the tail, but returns an indirection into the original list). roughly speaking: when constructing a ByteString via cons h t, we know the length of the result (1+length t), and instead of creating t in a separate storage location then copying it over to the result, we could try to create it inplace. or rather, we could delay construction of ByteStrings until we know whether we need to allocate fresh memory for them or whether they can be created by filling some context. this rough idea runs into a couple of issues in practice: - first of all, it seems to be a runtime fusion (unless we do whole program optimization, simplifier rewrites won't do, although unfolding recursions might still expose some opportunities for a static variant of this fusion) - if t is shared, we'd like to redirect these shared references to an indirection into the tail of cons h t ignoring the second point for now, if we look into the source for cons, we find something like: cons :: Word8 -> ByteString -> ByteString cons c (PS x s l) = unsafeCreate (l+1) $ \p-> ... poke p c; memcpy (p+1).. now, let's imagine a pre-ByteString as a not-yet allocated ByteString: data PreBS = PrePS l f so that createBS :: PreBS -> ByteString createBS (PrePS l f) = unsafeCreate l f consPre :: Word8 -> ByteString -> PrePS consPre c (PS x s l) = PrePS (l+1) $ \p-> ... poke p c; memcpy (p+1).. cons :: Word8 -> ByteString -> ByteString cons c bs = createBS (consPre c bs) then we could express our fusion as consPre c (create (PrePS l f)) = PrePS (l+1) $ \p-> ... poke p c; f (p+1).. in other words, in a typical map-like recursion scheme, we do not create, copy & release the tails recursively, but delay creation until we know where to embed our ByteString, at which point we do a sequence of pokes, no memcpy. but note that we are matching on a function application here, even though we are not in a simplifier rule, so this doesn't work as it stands.. this is as far as I got so far.. (see attached example for a manual use of pre-ByteStrings to speed up a map). now my questions for you:-) - does this make sense? - can it be made to work? as we probably cannot redirect shared references to t in (cons h t), can we identify the situations where t is not shared, as in a map, or can we just ignore any shared references (they will point to a "create (PrePS ..)" combination, and should just keep working, since we bypass those combinations instead of rewriting them)? a tentative idea would be to overload create so that it produces a proper, allocated ByteString where such is expected, but can also just pass through the PreBS where the context can handle it? class Create r where create :: PreBS -> r instance Create ByteString where create (PrePS l f) = unsafeCreate l f instance Create PreBS where create = id consPre :: Word8 -> ByteString -> PrePS consPre c (PS x s l) = PrePS (l+1) $ \p-> ... poke p c; memcpy (p+1).. cons :: Word8 -> ByteString -> ByteString cons c bs = create (consPre c bs) consPre' :: Word8 -> PrePS -> PrePS consPre' c (PrePS l f) = PrePS (l+1) $ \p-> ... poke p c; f (p+1).. so that "consPre c (create (PrePS ..))" works as normal, while "consPre' c (create (PrePS ..))" uses the fusion path. or something like that;-) Claus