
Hey all, All the object serialization/deserialization libraries I could find (pretty much just binary and cereal) seem to be strict with respect to the actual data being serialized. In particular, if I've serialized a large [Int] to a file, and I want to get the first element, it seems I have no choice but to deserialize the entire data structure. This is obviously an issue for large data sets. There are obvious workarounds (explicitly fetch elements from the "database" instead of relying on unsafeInterleaveIO to deal with it all magically), but it seems like it should be possible to build a cereal-like library that allows proper lazy deserialization. Does it exist, and I've just missed it? Thanks, -- Scott Lawrence Linux baidar 3.7.10-1-ARCH #1 SMP PREEMPT Thu Feb 28 09:50:17 CET 2013 x86_64 GNU/Linux

On 3/13/2013 12:15 AM, Scott Lawrence wrote:
Hey all,
All the object serialization/deserialization libraries I could find (pretty much just binary and cereal) seem to be strict with respect to the actual data being serialized. In particular, if I've serialized a large [Int] to a file, and I want to get the first element, it seems I have no choice but to deserialize the entire data structure. This is obviously an issue for large data sets.
There are obvious workarounds (explicitly fetch elements from the "database" instead of relying on unsafeInterleaveIO to deal with it all magically), but it seems like it should be possible to build a cereal-like library that allows proper lazy deserialization. Does it exist, and I've just missed it?
Thanks,
I haven't tested this, but I suspect something like this could give you lazy binary serialization and deserialization. It's not tail recursive, though. newtype LazyBinaryList a = LazyBinaryList [a] instance Binary a => LazyBinaryList a where put (LazyBinaryList []) = putWord8 0 put (LazyBinaryList (x:xs)) = putWord8 1 >> put x >> put (LazyBinaryList xs) get = do t <- getWord8 case t of 0 -> return (LazyBinaryList []) 1 -> do x <- get (LazyBinaryList xs) <- get return $ LazyBinaryList (x:xs)

I tried it, but it still goes and reads the whole list. Looking at the `binary` package source code it seems that strict evaluation is hard-coded in a few places, presumably for performance reasons. It also seems to necessarily read the bytestring sequentially, so complex tree-like data structures would presumably encounter problems even if it worked for a list. Ah well. As long as I'm not duplicating someone else's work, I'm more than happy to go at this from scratch. On Wed, 13 Mar 2013, Jeff Shaw wrote:
On 3/13/2013 12:15 AM, Scott Lawrence wrote:
Hey all,
All the object serialization/deserialization libraries I could find (pretty much just binary and cereal) seem to be strict with respect to the actual data being serialized. In particular, if I've serialized a large [Int] to a file, and I want to get the first element, it seems I have no choice but to deserialize the entire data structure. This is obviously an issue for large data sets.
There are obvious workarounds (explicitly fetch elements from the "database" instead of relying on unsafeInterleaveIO to deal with it all magically), but it seems like it should be possible to build a cereal-like library that allows proper lazy deserialization. Does it exist, and I've just missed it?
Thanks,
I haven't tested this, but I suspect something like this could give you lazy binary serialization and deserialization. It's not tail recursive, though.
newtype LazyBinaryList a = LazyBinaryList [a]
instance Binary a => LazyBinaryList a where put (LazyBinaryList []) = putWord8 0 put (LazyBinaryList (x:xs)) = putWord8 1 >> put x >> put (LazyBinaryList xs) get = do t <- getWord8 case t of 0 -> return (LazyBinaryList []) 1 -> do x <- get (LazyBinaryList xs) <- get return $ LazyBinaryList (x:xs)
-- Scott Lawrence

that's too bad, i used lazy deserialization for an external sort thing i did aeons ago. http://www.haskell.org/pipermail/haskell-cafe/2007-July/029156.html that was an amusing exercise in lazy IO. these days it's probably better off doing something with pipes et al instead of unsafeInterleaveIO. b On Mar 13, 2013, at 2:54 PM, Scott Lawrence wrote:
I tried it, but it still goes and reads the whole list. Looking at the `binary` package source code it seems that strict evaluation is hard-coded in a few places, presumably for performance reasons. It also seems to necessarily read the bytestring sequentially, so complex tree-like data structures would presumably encounter problems even if it worked for a list.
Ah well. As long as I'm not duplicating someone else's work, I'm more than happy to go at this from scratch.
On Wed, 13 Mar 2013, Jeff Shaw wrote:
On 3/13/2013 12:15 AM, Scott Lawrence wrote:
Hey all, All the object serialization/deserialization libraries I could find (pretty much just binary and cereal) seem to be strict with respect to the actual data being serialized. In particular, if I've serialized a large [Int] to a file, and I want to get the first element, it seems I have no choice but to deserialize the entire data structure. This is obviously an issue for large data sets. There are obvious workarounds (explicitly fetch elements from the "database" instead of relying on unsafeInterleaveIO to deal with it all magically), but it seems like it should be possible to build a cereal-like library that allows proper lazy deserialization. Does it exist, and I've just missed it? Thanks, I haven't tested this, but I suspect something like this could give you lazy binary serialization and deserialization. It's not tail recursive, though.
newtype LazyBinaryList a = LazyBinaryList [a]
instance Binary a => LazyBinaryList a where put (LazyBinaryList []) = putWord8 0 put (LazyBinaryList (x:xs)) = putWord8 1 >> put x >> put (LazyBinaryList xs) get = do t <- getWord8 case t of 0 -> return (LazyBinaryList []) 1 -> do x <- get (LazyBinaryList xs) <- get return $ LazyBinaryList (x:xs)
-- Scott Lawrence
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 3/13/2013 6:14 PM, Ben wrote:
that's too bad, i used lazy deserialization for an external sort thing i did aeons ago.
http://www.haskell.org/pipermail/haskell-cafe/2007-July/029156.html
that was an amusing exercise in lazy IO. these days it's probably better off doing something with pipes et al instead of unsafeInterleaveIO.
b
On Mar 13, 2013, at 2:54 PM, Scott Lawrence wrote:
I tried it, but it still goes and reads the whole list. Looking at the `binary` package source code it seems that strict evaluation is hard-coded in a few places, presumably for performance reasons. It also seems to necessarily read the bytestring sequentially, so complex tree-like data structures would presumably encounter problems even if it worked for a list.
Ah well. As long as I'm not duplicating someone else's work, I'm more than happy to go at this from scratch.
On Wed, 13 Mar 2013, Jeff Shaw wrote:
On 3/13/2013 12:15 AM, Scott Lawrence wrote:
Hey all, All the object serialization/deserialization libraries I could find (pretty much just binary and cereal) seem to be strict with respect to the actual data being serialized. In particular, if I've serialized a large [Int] to a file, and I want to get the first element, it seems I have no choice but to deserialize the entire data structure. This is obviously an issue for large data sets. There are obvious workarounds (explicitly fetch elements from the "database" instead of relying on unsafeInterleaveIO to deal with it all magically), but it seems like it should be possible to build a cereal-like library that allows proper lazy deserialization. Does it exist, and I've just missed it? Thanks, I haven't tested this, but I suspect something like this could give you lazy binary serialization and deserialization. It's not tail recursive, though.
newtype LazyBinaryList a = LazyBinaryList [a]
instance Binary a => LazyBinaryList a where put (LazyBinaryList []) = putWord8 0 put (LazyBinaryList (x:xs)) = putWord8 1 >> put x >> put (LazyBinaryList xs) get = do t <- getWord8 case t of 0 -> return (LazyBinaryList []) 1 -> do x <- get (LazyBinaryList xs) <- get return $ LazyBinaryList (x:xs)
-- Scott Lawrence
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe I tried really hard to make Binary read a list in lazily, but I failed. Here's a full program that demonstrates the idea I had earlier. I used binary-0.7.0.1*.*
import Data.Binary import Data.Binary.Get import System.Environment import qualified Data.ByteString.Lazy as L newtype LazyBinaryList a = LazyBinaryList [a] deriving (Show) instance Binary a => Binary (LazyBinaryList a) where put = putLazy get = getLazy putLazy :: Binary a => LazyBinaryList a -> Put putLazy (LazyBinaryList []) = putWord8 0 putLazy (LazyBinaryList (x:xs)) = putWord8 1 >> put x >> putLazy (LazyBinaryList xs) getLazy :: Binary a => Get (LazyBinaryList a) getLazy = aux [] where aux accum = do t <- getWord8 case t of 0 -> return (LazyBinaryList (reverse accum)) 1 -> do x <- get aux (x:accum) list :: LazyBinaryList Int list = LazyBinaryList [1..10000] main = do args <- getArgs case args !! 0 of "encode" -> encodeFile (args !! 1) list "decode" -> do bs <- L.readFile $ args !! 1 let lOrR = decodeOrFail bs :: Either (L.ByteString, ByteOffset, String) (L.ByteString, ByteOffset, LazyBinaryList Int) case lOrR of Left l -> print l Right (_,offset,LazyBinaryList l) -> putStrLn $ "at offset " ++ show offset ++ " read value " ++ show (head l)

Scott Lawrence
All the object serialization/deserialization libraries I could find (pretty much just binary and cereal) seem to be strict with respect to the actual data being serialized.
Binary became strict between 0.4.4 and 0.5, I think doing so improved the speed of GHC. I had a file format wrapper at the time which depended on it being lazy, and was stuck at 0.4.4 for a long time, until somebody helped me with a workaround. Performance is a bit worse than it was with 0.4.4, but I think it's better to stay with current code. The code in question is in the darcs repository at: http://malde.org/~ketil/biohaskell/biosff the relevant patch is named "Update to use [...]" -k -- If I haven't seen further, it is by standing in the footprints of giants
participants (4)
-
Ben
-
Jeff Shaw
-
Ketil Malde
-
Scott Lawrence