Interesting folds over bytestring lists?

I have a data structure which is a list of bytestrings, but externally it looks like one big string. One of the operations I want to support takes a section of the string, starting at some arbitrary index and ending somewhere further down the line. In implementing the function I came up with the two functions below, dropTo and takeTo. In my mind, dropTo moves over the list of bytestrings until it reaches the starting point, and then returns the rest. takeTo, in contrast, scans over the list until it has seen enough bytes to return the amount requested. In both cases I am trying to share structure as much as possible in order to avoid unnecessary copying and space leaks. I thought these two functions were interested and am looking for feedback, comments, improvements, etc. Thanks! import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Base as B import Test.QuickCheck import Data.List (foldl') import Data.Word dropTo :: Int -> [L.ByteString] -> [L.ByteString] dropTo _ [] = [] dropTo amt strs = let dropTo' :: (Int, [L.ByteString]) -> L.ByteString -> (Int, [L.ByteString]) dropTo' (rem, acc) ss | rem == 0 = (0, acc) | otherwise = let chunks = L.toChunks ss in case foldl' dropStricts (rem, chunks) chunks of (!n, rest) | null rest -> (n, drop 1 acc) | otherwise -> (0, L.fromChunks rest : drop 1 acc) dropStricts :: (Int, [S.ByteString]) -> S.ByteString -> (Int, [S.ByteString]) dropStricts (rem, acc) str | rem == 0 = (0, acc) | rem - S.length str == 0 = (0, drop 1 acc) | rem - S.length str < 0 = (0, S.drop rem str : (drop 1 acc)) | otherwise = (rem - S.length str, drop 1 acc) (_, rest) = foldl' dropTo' (amt, strs) strs in rest takeTo :: Int -> [L.ByteString] -> [L.ByteString] takeTo _ [] = [] takeTo amt strs = let countLazies :: (Int, Int, L.ByteString) -> L.ByteString -> (Int, Int, L.ByteString) countLazies (rem, !total, lazyLeftover) ss | rem == 0 = (0, total, lazyLeftover) | otherwise = let chunks = L.toChunks ss in case foldl' countStricts (rem, 0, S.empty) chunks of (!n, amt, strictLeftover) | S.null strictLeftover -> (n, total + 1, L.empty) | otherwise -> (n, total, L.fromChunks (take amt chunks ++ [strictLeftover])) countStricts :: (Int, Int, S.ByteString) -> S.ByteString -> (Int, Int, S.ByteString) countStricts (rem, !total, leftover) str | rem == 0 = (0, total, leftover) | rem - S.length str == 0 = (0, total + 1, S.empty) | rem - S.length str < 0 = (0, total, S.take rem str) | otherwise = (rem - S.length str, total + 1, S.empty) in case foldl' countLazies (amt, 0, L.empty) strs of (_, total, leftover) | L.null leftover -> take total strs | otherwise -> take total strs ++ [leftover] prop_dropToNonEmpty :: [[Word8]] -> Int -> Property prop_dropToNonEmpty strs amt = amt >= 0 && (all (not . null) strs) ==> all (not . L.null) (dropTo amt (map (L.pack) strs)) prop_dropToCorrect :: [[Word8]] -> Int -> Property prop_dropToCorrect strs amt = let lazyStr = L.drop amt64 (toLazyBS strs) amt64 = fromIntegral amt in amt >= 0 && (all (not . null) strs) ==> (L.concat (dropTo amt (map (L.pack) strs))) == lazyStr prop_takeToNonEmpty :: [[Word8]] -> Int -> Property prop_takeToNonEmpty strs amt = amt >= 0 && (all (not . null) strs) ==> all (not . L.null) (takeTo amt (map (L.pack) strs)) prop_takeToCorrect :: [[Word8]] -> Int -> Property prop_takeToCorrect strs amt = let lazyStr = L.take amt64 (toLazyBS strs) amt64 = fromIntegral amt in amt >= 0 && (all (not . null) strs) ==> (L.concat (takeTo amt (map (L.pack) strs))) == lazyStr -- Functions and instances for testing purposes toLazyBS :: [[Word8]] -> L.ByteString toLazyBS = L.concat . map L.pack instance Arbitrary Word8 where arbitrary = elements [minBound .. maxBound] coarbitrary = undefined main = do putStrLn "prop_takeToNonEmpty" quickCheck prop_takeToNonEmpty putStrLn "prop_takeToCorrect" quickCheck prop_takeToCorrect putStrLn "prop_dropToCorrect" quickCheck prop_dropToCorrect putStrLn "prop_dropToNonEmpty" quickCheck prop_dropToNonEmpty

In message
I have a data structure which is a list of bytestrings, but externally it looks like one big string.
A lazy bytestring is a list of strict bytestring which externally looks like one big string. Could you not just use a lazy bytestring and it's take and drop functions? Perhaps you can help me understand what it is you're trying to do? Duncan

On 9/20/07, Duncan Coutts
A lazy bytestring is a list of strict bytestring which externally looks like one big string. Could you not just use a lazy bytestring and it's take and drop functions? Perhaps you can help me understand what it is you're trying to do?
I'm working on the ICFP contest from this year, and the algorithm frequently prepends long strings to the front of the "DNA" string being processed. I originally worked only with a lazy bytestring but it 'append' wasn't fast enough, so I'm trying this representation. Your email makes me think I should work directly with a list of strict bytestrings, but in that case what happens when I want to take a large chunk of strings out of the middle of the list? Would that be an O(n) operation? Justin

On 9/20/07, Justin Bailey
Your email makes me think I should work directly with a list of strict bytestrings, but in that case what happens when I want to take a large chunk of strings out of the middle of the list? Would that be an O(n) operation?
I used a list of strict bytestrings for this task and got pretty good performance (40+k iterations per second, around 40 seconds to run endo.dna) A strict bytestring looks like this: import qualified Data.ByteString.Base as B B.PS fptr offset length (PS means Packed String) fptr :: ForeignPtr Word8 fptr holds a safe pointer an immutable block of memory allocated for a bytestring. This buffer can be shared between multiple bytestrings. offset :: Int offset is the offset from the beginning of the bytestring. length :: Int length is the length of the bytestring (in bytes) This makes the take and drop functions on bytestrings extremely fast: take n s@(B.PS fptr offset length) | n <= 0 = B.empty | n >= length = s | otherwise = B.PS fptr offset n drop n s@(B.PS fptr offset length) | n <= 0 = s | n >= length = B.empty | otherwise = B.PS fptr (offset + n) (length - n) (spoiler follows) ... ... ... ... ... ... What you will find, however, is that a list of bytestrings isn't -quite- good enough; the first thing the DNA does is create a huge number of tiny copies of a single base. You can solve this by setting a threshold number of chunks in the DNA list, and "garbage collecting" the result into a single bytestring every time the number of chunks grows too high. -- ryan

In message
On 9/20/07, Duncan Coutts
wrote: A lazy bytestring is a list of strict bytestring which externally looks like one big string. Could you not just use a lazy bytestring and it's take and drop functions? Perhaps you can help me understand what it is you're trying to do?
I'm working on the ICFP contest from this year, and the algorithm frequently prepends long strings to the front of the "DNA" string being processed. I originally worked only with a lazy bytestring but it 'append' wasn't fast enough, so I'm trying this representation.
But you do realise it's exactly the same representation. Append for a lazy bytestring is O(n) in the number of chunks n, this will also be true for your 'new' representation.
Your email makes me think I should work directly with a list of strict bytestrings,
That's exactly what a lazy bytestring is. You'll not get any performance improvements without changing the data representation. A list is not good enough for what you want to do because so many operations are O(n) in the number of chunks.
but in that case what happens when I want to take a large chunk of strings out of the middle of the list? Would that be an O(n) operation?
Yes. That's exactly the problem. What you want rather than a list of strict bytestrings is a tree of strict bytestrings. You want a fingertree of strict bytestrings: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/fingertree newtype ByteSequence = BS (FingerTree (Sum Int) Strict.ByteString) instance Measured (Sum Int) Strict.ByteString where measure = Sum . Strict.length You'll have to wrap the operations you need, (like split, take, drop and append) to make the ByteSequence look like a single sequence of bytes rather than a sequence of chunks. You probably want to enforce an invariant that no chunk is ever empty (as we do for lazy bytestrings). For best performance over a large number of inserts and deletes you might need to implement merging adjacent small blocks so that the block size does not degrade too much. An alternative structure if you tend to do lots of inserts and deletes at near the same spot is a zipper structure with a cursor. I'm not so sure what the best structure for that might be, perhaps just a pair of finger trees giving the parts of the sequence before and after the insertion point (since finger trees give fast access to the ends but slower O(log n) access to points n chunks from the closer end). Have fun :-) I should point out that other people who did this year's ICFP contest have also looked at structures like this (though mostly after the contest finished), so you might want to talk or collaborate with them. Duncan

On 9/21/07, Duncan Coutts
I should point out that other people who did this year's ICFP contest have also looked at structures like this (though mostly after the contest finished), so you might want to talk or collaborate with them.
I used exactly this structure written in Java and merged nodes.
participants (4)
-
Duncan Coutts
-
Johan Tibell
-
Justin Bailey
-
Ryan Ingram