Re: [Haskell-cafe] Interesting folds over bytestring lists?

21 Sep 2007

      In message  "Justin
Bailey"  writes:
...
On 9/20/07, Duncan Coutts  wrote:
...
A lazy bytestring is a list of strict bytestring which externally looks like one
big string. Could you not just use a lazy bytestring and it's take and drop
functions? Perhaps you can help me understand what it is you're trying to do?
I'm working on the ICFP contest from this year, and the algorithm
frequently prepends long strings to the front of the "DNA" string
being processed. I originally worked only with a lazy bytestring but
it 'append' wasn't fast enough, so I'm trying this representation.
But you do realise it's exactly the same representation. Append for a lazy
bytestring is O(n) in the number of chunks n, this will also be true for your
'new' representation.
...
Your email makes me think I should work directly with a list of strict
bytestrings,
That's exactly what a lazy bytestring is. You'll not get any performance
improvements without changing the data representation. A list is not good enough
for what you want to do because so many operations are O(n) in the number of chunks.
...
but in that case what happens when I want to take a large
chunk of strings out of the middle of the list? Would that be an O(n)
operation?
Yes. That's exactly the problem.

What you want rather than a list of strict bytestrings is a tree of strict
bytestrings. You want a fingertree of strict bytestrings:

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/fingertree

newtype ByteSequence = BS (FingerTree (Sum Int) Strict.ByteString)

instance Measured (Sum Int) Strict.ByteString where
  measure = Sum . Strict.length

You'll have to wrap the operations you need, (like split, take, drop and append)
to make the ByteSequence look like a single sequence of bytes rather than a
sequence of chunks. You probably want to enforce an invariant that no chunk is
ever empty (as we do for lazy bytestrings). For best performance over a large
number of inserts and deletes you might need to implement merging adjacent small
blocks so that the block size does not degrade too much.

An alternative structure if you tend to do lots of inserts and deletes at near
the same spot is a zipper structure with a cursor. I'm not so sure what the best
structure for that might be, perhaps just a pair of finger trees giving the
parts of the sequence before and after the insertion point (since finger trees
give fast access to the ends but slower O(log n) access to points n chunks from
the closer end).

Have fun :-)

I should point out that other people who did this year's ICFP contest have also
looked at structures like this (though mostly after the contest finished), so
you might want to talk or collaborate with them.

Duncan