two problems with Data.Binary and Data.ByteString

I have a program that read in and populated a large data structure and then saved it out with Data.Binary and Data.ByteString.Lazy.Char8: saveState db = B.writeFile stateFile =<< encode <$> atomically (readTVar db) when I go to read this in later I get a stack overflow: loadState db = do d <- decode <$> B.readFile stateFile atomically $ writeTVar db d Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. or from ghci: d <- liftM decode (Data.ByteString.Lazy.Char8.readFile "savedState.bin") :: IO InstrsDb fromList *** Exception: stack overflow The data type I'm storing is a Map (of maps): type DailyDb = M.Map Date Daily type InstrsDb = M.Map String DailyDb What's going on here? Why is the system capable of building and saving the data but not in reading and umarhsalling it? What is the proper way to track down where the exception is happening? Any debugging tips? I also noticed another issue while testing. If my program loads the data at startup by calling loadState then all later calls to saveState give an error: Log: savedState.bin: openFile: resource busy (file is locked) this does not occur if the program wasnt loaded. My best guess here is that B.readFile isnt completing and closing the file for some reason. Is there a good way to force this? Tim Newsham http://www.thenewsh.com/~newsham/

newsham:
I have a program that read in and populated a large data structure and then saved it out with Data.Binary and Data.ByteString.Lazy.Char8:
saveState db = B.writeFile stateFile =<< encode <$> atomically (readTVar db)
when I go to read this in later I get a stack overflow:
loadState db = do d <- decode <$> B.readFile stateFile atomically $ writeTVar db d
Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it.
or from ghci:
d <- liftM decode (Data.ByteString.Lazy.Char8.readFile "savedState.bin") :: IO InstrsDb
fromList *** Exception: stack overflow
The data type I'm storing is a Map (of maps):
type DailyDb = M.Map Date Daily type InstrsDb = M.Map String DailyDb
What's going on here? Why is the system capable of building and saving the data but not in reading and umarhsalling it? What is the proper way to track down where the exception is happening? Any debugging tips?
So a big Map is serialised as a huge list. instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where put = put . Map.toAscList get = liftM Map.fromDistinctAscList get so that fromAscList's the result of parsing the map as a list, via, instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get so that's a length-prefixed list, strictly. Which is possibly where the stack's being consumed. Does just bumping the stack size a bit help? Alternatively, you could consider serialising the Map in some other format (i.e. newtype the list, and serialise that say, in a lazy/chunked encoding).
I also noticed another issue while testing. If my program loads the data at startup by calling loadState then all later calls to saveState give an error:
Log: savedState.bin: openFile: resource busy (file is locked)
this does not occur if the program wasnt loaded. My best guess here is that B.readFile isnt completing and closing the file for some reason. Is there a good way to force this?
Lazy IO. So force the result to be evaluated, and then close the handle, or use strict bytestring reading. -- Don

so that fromAscList's the result of parsing the map as a list, via,
instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get
so that's a length-prefixed list, strictly. Which is possibly where the stack's being consumed. Does just bumping the stack size a bit help?
ugh.. length prefix.. I could bump the stack size to fix my immediate situation, but my goal is to have a server with a huge in-memory data set, and my test data so far is quite small.
Alternatively, you could consider serialising the Map in some other format (i.e. newtype the list, and serialise that say, in a lazy/chunked encoding).
hackery :( but that sounds like the obvious fix.
Log: savedState.bin: openFile: resource busy (file is locked)
this does not occur if the program wasnt loaded. My best guess here is that B.readFile isnt completing and closing the file for some reason. Is there a good way to force this?
Lazy IO. So force the result to be evaluated, and then close the handle, or use strict bytestring reading.
There is no visible handle. It's all hidden in readFile. I will try forcing the data.
-- Don
Tim Newsham http://www.thenewsh.com/~newsham/

newsham:
so that fromAscList's the result of parsing the map as a list, via,
instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get
so that's a length-prefixed list, strictly. Which is possibly where the stack's being consumed. Does just bumping the stack size a bit help?
ugh.. length prefix.. I could bump the stack size to fix my immediate situation, but my goal is to have a server with a huge in-memory data set, and my test data so far is quite small.
Alternatively, you could consider serialising the Map in some other format (i.e. newtype the list, and serialise that say, in a lazy/chunked encoding).
hackery :( but that sounds like the obvious fix.
Not hackery, just a different encoding. The default Binary encodings don't work cover all use cases and all scales. To hit other sweet spots, use your own instances.
Log: savedState.bin: openFile: resource busy (file is locked)
this does not occur if the program wasnt loaded. My best guess here is that B.readFile isnt completing and closing the file for some reason. Is there a good way to force this?
Lazy IO. So force the result to be evaluated, and then close the handle, or use strict bytestring reading.
There is no visible handle. It's all hidden in readFile. I will try forcing the data.
So you can decode using openFile, hGet and hClose on strict bytetrings, or force the data. -- Don

On Tue, Aug 12, 2008 at 9:32 PM, Don Stewart
Not hackery, just a different encoding. The default Binary encodings don't work cover all use cases and all scales. To hit other sweet spots, use your own instances.
Doesn't Data.Map.size run in O(1) time? Maybe something like using different encodings for big maps in the default implementation would help? -- Felipe.

Doesn't Data.Map.size run in O(1) time? Maybe something like using different encodings for big maps in the default implementation would help?
ugh, of course. Ok, so I fixed it to: loadState db = do d <- decode <$> B.readFile stateFile let force = sum $ map (sum . map fromEnum . M.keys) $ M.elems d print force force `seq` atomically $ writeTVar db d (my keys are dates, which are Enum). This should look at every key in every inner map. Shouldn't that be sufficient to force the entire data set (or do I have to touch the fields in the data elements too?). I still get the same error condition.
Felipe.
Tim Newsham http://www.thenewsh.com/~newsham/

On Tue, Aug 12, 2008 at 6:01 PM, Tim Newsham
(my keys are dates, which are Enum). This should look at every key in every inner map. Shouldn't that be sufficient to force the entire data set (or do I have to touch the fields in the data elements too?)
You might have to force the last value of the alist that the map gets flattened into, since otherwise there's no guarantee that it will be read. You really, really want to be using rnf for this job, instead of turning your brain into a pretzel shape.

You might have to force the last value of the alist that the map gets flattened into, since otherwise there's no guarantee that it will be read.
You really, really want to be using rnf for this job, instead of turning your brain into a pretzel shape.
*nod* that's my eventual goal but I'd like to make sure I understand what is going on here first and rule out any bugs before I go using some class I'm not that familiar with. I'm starting to wonder if this isn't an issue with Data.ByteString.Lazy.Char8.{read,write}File. I am now printing out the data entirely: d <- decode <$> B.readFile stateFile print d and I still get the same error when I go to writeFile later. There should be no data items in any of my structures that the print statement does not force. Tim Newsham http://www.thenewsh.com/~newsham/

I'm starting to wonder if this isn't an issue with Data.ByteString.Lazy.Char8.{read,write}File.
This simple test case fails: module Main where import qualified Data.ByteString.Lazy.Char8 as B main = do print =<< B.readFile "xxx" B.writeFile "xxx" =<< B.readFile "test.hs" If you replace B.readFile with readFile and B.writeFile with writeFile it works properly. ByteString bug? Tim Newsham http://www.thenewsh.com/~newsham/

newsham:
I'm starting to wonder if this isn't an issue with Data.ByteString.Lazy.Char8.{read,write}File.
This simple test case fails:
module Main where import qualified Data.ByteString.Lazy.Char8 as B main = do print =<< B.readFile "xxx" B.writeFile "xxx" =<< B.readFile "test.hs"
If you replace B.readFile with readFile and B.writeFile with writeFile it works properly. ByteString bug?
Ah, that would be a bug in older ByteString implementations, that were a bit incautious about closing handles. This example works for me with bytestring-0.9.1.0 You're looking for a post-Dec 19, 2007 release, after the patch, Wed Dec 19 22:06:13 PST 2007 Don Stewart * For lazy IO operations, be sure to hClose the resource on EOF -- Don

Ah, that would be a bug in older ByteString implementations, that were a bit incautious about closing handles. This example works for me with
bytestring-0.9.1.0
Yup, thank you Don and Duncan for pointing this out. I updated my bytestring library and the test case no longer fails. However, I'm still having problems and not sure why. I was able to distill the problem down to this: $ od -x 1word32.bin 0000000 0500 2ca4 $ runhaskell test6.hs loading... saving... test6.hs: 1word32.bin: openFile: resource busy (file is locked) $ cat test6.hs module Main where import Control.Applicative import Control.Parallel.Strategies (rnf, NFData, using) import Data.Binary import qualified Data.ByteString.Lazy.Char8 as B import Data.Word stateFile = "1word32.bin" loadState :: IO Word32 loadState = decode <$> B.readFile stateFile saveState :: Word32 -> IO () saveState db = B.writeFile stateFile $ encode db {- -- Works! loadState = B.readFile stateFile saveState = B.writeFile stateFile -} -- force x = print x >> return x force = return . (`using` rnf) main = do putStrLn "loading..." d <- force =<< loadState putStrLn "saving..." saveState d I tried this both with "print" and "rnf" to the same effect. It looks like there still might be some situations where the file isn't being closed? Should the file be closed when the last byte is read (in this case its definitely reading all four bytes) or when the first byte after that is read (in this case it probably doesn't attempt to read more than 4 bytes)? Tim Newsham http://www.thenewsh.com/~newsham/

Should the file be closed when the last byte is read (in this case its definitely reading all four bytes) or when the first byte after that is read (in this case it probably doesn't attempt to read more than 4 bytes)?
I'll answer my own question. Both Prelude.readFile and Data.ByteString.Lazy.Char8.readFile will keep the file open after reading the last byte and close it when trying to read further. Proof: module Main where import Control.Applicative -- import qualified Data.ByteString.Lazy.Char8 as B import Prelude as B stateFile = "1word32.bin" main = do x <- B.take 4 <$> B.readFile stateFile -- x <- B.take 5 <$> B.readFile stateFile print x B.writeFile stateFile x This works for Prelude and ByteString when taking 5 (there are exactly 4 bytes in "1word32.bin") and fail when taking 4. I'm not sure that this behavior is so bad.. there might be some advantages... but it might be nice to have it close after the last byte is read... However, I think probably the real blame here should probably go to Data.Binary which doesn't attempt to check that it has consumed all of its input after doing a "decode". If "decode" completes and there is unconsumed data, it should probably raise an error (it already raises errors for premature EOF). There's no reason for it not to, since it does not provide the unconsumed data to the caller when its done, anyway... Thoughts? Tim Newsham http://www.thenewsh.com/~newsham/

newsham:
Should the file be closed when the last byte is read (in this case its definitely reading all four bytes) or when the first byte after that is read (in this case it probably doesn't attempt to read more than 4 bytes)?
I'll answer my own question. Both Prelude.readFile and Data.ByteString.Lazy.Char8.readFile will keep the file open after reading the last byte and close it when trying to read further. Proof:
module Main where import Control.Applicative -- import qualified Data.ByteString.Lazy.Char8 as B import Prelude as B
stateFile = "1word32.bin" main = do x <- B.take 4 <$> B.readFile stateFile -- x <- B.take 5 <$> B.readFile stateFile print x B.writeFile stateFile x
This works for Prelude and ByteString when taking 5 (there are exactly 4 bytes in "1word32.bin") and fail when taking 4.
I'm not sure that this behavior is so bad.. there might be some advantages... but it might be nice to have it close after the last byte is read...
However, I think probably the real blame here should probably go to Data.Binary which doesn't attempt to check that it has consumed all of its input after doing a "decode". If "decode" completes and there is unconsumed data, it should probably raise an error (it already raises errors for premature EOF). There's no reason for it not to, since it does not provide the unconsumed data to the caller when its done, anyway...
This is perhaps a use case for Data.Binary.Strict then. -- Don

On Wed, Aug 13, 2008 at 5:02 PM, Tim Newsham
However, I think probably the real blame here should probably go to Data.Binary which doesn't attempt to check that it has consumed all of its input after doing a "decode". If "decode" completes and there is unconsumed data, it should probably raise an error (it already raises errors for premature EOF). There's no reason for it not to, since it does not provide the unconsumed data to the caller when its done, anyway...
Would the error be raised in 'decode' or in 'runGet'? On a project in progress I rely on 'runGet' to toss out padding bytes for me. -Antoine

Ok, surely at least everyone must agree that this is a bug: force :: Word8 -> IO Word8 force x = print x >> return x -- force = return . (`using` rnf) main = do d <- force =<< decodeFile stateFile encodeFile stateFile d where stateFile = "1word32.bin" test8.hs: 1word32.bin: openBinaryFile: resource busy (file is locked) the built-in Data.Binary.decodeFile function doesn't close its handle when it is done (same reason as my earlier examples).
However, I think probably the real blame here should probably go to Data.Binary which doesn't attempt to check that it has consumed all of its input after doing a "decode". If "decode" completes and there is unconsumed data, it should probably raise an error (it already raises errors for premature EOF). There's no reason for it not to, since it does not provide the unconsumed data to the caller when its done, anyway...
I would have expected this to fix my problems: binEof :: Get () binEof = do more <- not <$> isEmpty when more $ error "expected EOF" decodeFully :: Binary b => B.ByteString -> b decodeFully = runGet (get << binEof) where a << b = a >>= (\x -> b >> return x) but even when using decodeFully, it still doesn't close the handle. Shouldn't Data.Binary.Get.isEmpty force a file handle close in the case that it returns True? Tim Newsham http://www.thenewsh.com/~newsham/

newsham:
Ok, surely at least everyone must agree that this is a bug:
force :: Word8 -> IO Word8 force x = print x >> return x -- force = return . (`using` rnf)
main = do d <- force =<< decodeFile stateFile encodeFile stateFile d where stateFile = "1word32.bin"
test8.hs: 1word32.bin: openBinaryFile: resource busy (file is locked)
Remember that decodeFile f = liftM decode (L.readFile f) and readFile :: FilePath -> IO ByteString readFile f = openBinaryFile f ReadMode >>= hGetContents where hGetContents sits in a loop, reading chunks, loop = do c <- S.hGetNonBlocking h k if S.null c then do eof <- hIsEOF h if eof then hClose h >> return Empty else hWaitForInput h (-1) >> loop else do cs <- lazyRead return (Chunk c cs) while isEmpty is just, isEmpty :: Get Bool isEmpty = do S s ss _ <- get return (B.null s && L.null ss) That is, it checks the parsed chunk, it doesn't demand any more reading be done. So the only way you're going to get that Handle closed by readFile is to ensure you read till EOF is hit. After you decode, just ask keep asking for bytes till EOF, or close it yourself, decodeFile f = do h <- openFile f ReadMode ss <- L.hGetContents h let e = decode ss rnf e `seq` hClose h or some such, where you can confirm the decoding as taken place.

So am I understanding you correctly that you believe this is not a bug? That the use Data.Binary.decodeFile function leaks a file descriptor and this is proper behavior? I still don't understand your explanation of how isEmpty can return True without having read to EOF. The ByteString continues to contain more data until an EOF is reached. Doesn't one of return (B.null s && L.null ss) force getContents to read until EOF? On Wed, 13 Aug 2008, Don Stewart wrote:
newsham:
Ok, surely at least everyone must agree that this is a bug:
force :: Word8 -> IO Word8 force x = print x >> return x -- force = return . (`using` rnf)
main = do d <- force =<< decodeFile stateFile encodeFile stateFile d where stateFile = "1word32.bin"
test8.hs: 1word32.bin: openBinaryFile: resource busy (file is locked)
Remember that
decodeFile f = liftM decode (L.readFile f)
and
readFile :: FilePath -> IO ByteString readFile f = openBinaryFile f ReadMode >>= hGetContents
where hGetContents sits in a loop, reading chunks,
loop = do c <- S.hGetNonBlocking h k if S.null c then do eof <- hIsEOF h if eof then hClose h >> return Empty else hWaitForInput h (-1) >> loop else do cs <- lazyRead return (Chunk c cs)
while isEmpty is just,
isEmpty :: Get Bool isEmpty = do S s ss _ <- get return (B.null s && L.null ss)
That is, it checks the parsed chunk, it doesn't demand any more reading be done.
So the only way you're going to get that Handle closed by readFile is to ensure you read till EOF is hit. After you decode, just ask keep asking for bytes till EOF, or close it yourself,
decodeFile f = do h <- openFile f ReadMode ss <- L.hGetContents h let e = decode ss rnf e `seq` hClose h
or some such, where you can confirm the decoding as taken place.
Tim Newsham http://www.thenewsh.com/~newsham/

newsham:
So am I understanding you correctly that you believe this is not a bug? That the use Data.Binary.decodeFile function leaks a file descriptor and this is proper behavior?
It's not a bug. It's lazy IO. If you want the Handle closed, demand all the input. isEmpty will do this for you, if you're at the end of the file already.
I still don't understand your explanation of how isEmpty can return True without having read to EOF. The ByteString continues to contain more data until an EOF is reached. Doesn't one of
return (B.null s && L.null ss)
isEmpty is perfectly fine. You're just not demanding its result. Consider, {-# LANGUAGE BangPatterns #-} import Data.Word import Data.Binary import Data.Binary.Get import qualified Data.ByteString.Lazy as L import System.IO main = do encodeFile stateFile (42 :: Word32) d <- strictDecodeFile stateFile :: IO Word32 encodeFile stateFile d print d where stateFile = "1word32.bin" strictDecodeFile :: Binary a => FilePath -> IO a strictDecodeFile f = do ss <- L.readFile f return $! runGet (do v <- get !m <- isEmpty -- if we're at the end, this will close it return v) ss Look at strictDecodeFile. It's pretty much identical to the normal decodeFile, but it assumes 'get' will consume the entire file. It then checks for null, which will trigger an EOF and close if you are actually at the end. So we just decode the file, and check if the buffer's empty at the end. $ ghc --make A.hs [1 of 1] Compiling Main ( A.hs, A.o ) Linking A ... $ ./A 42 But if we leave out that bang pattern on isEmpty, the test won't run, and we'll get, $ ./A A: 1word32.bin: openBinaryFile: resource busy (file is locked) So were you just confused about how to use isEmpty? You could also explicit close in strictDecodeFile, strictDecodeFile :: Binary a => FilePath -> IO a strictDecodeFile f = do h <- openFile f ReadMode ss <- L.hGetContents h f let !v = runGet (do v <- get return v) ss hClose h return v Whatever works best for you. -- Don

On Wed, Aug 13, 2008 at 5:39 PM, Tim Newsham
So am I understanding you correctly that you believe this is not a bug? That the use Data.Binary.decodeFile function leaks a file descriptor and this is proper behavior?
I think he might be saying that decodeFile is not the place for checking this condition. I will put words in his mouth and say that checking for EOF after a decode is the responsibility of the application code, because the lower level cannot possibly know whether it makes sense for there to be residual data in the ByteString. There are plenty of file formats that consist of back-to-back concatenated chunks of data, in which reading just one chunk does not by any means require that a file can only contain one.

bos:
On Wed, Aug 13, 2008 at 5:39 PM, Tim Newsham
wrote: So am I understanding you correctly that you believe this is not a bug? That the use Data.Binary.decodeFile function leaks a file descriptor and this is proper behavior?
I think he might be saying that decodeFile is not the place for checking this condition. I will put words in his mouth and say that checking for EOF after a decode is the responsibility of the application code, because the lower level cannot possibly know whether it makes sense for there to be residual data in the ByteString. There are plenty of file formats that consist of back-to-back concatenated chunks of data, in which reading just one chunk does not by any means require that a file can only contain one.
Exactly. This particular condition -- that encode should consume exactly the amount of data in the input file, and be sitting on EOF at the end -- is application dependent. That said, there's an argument to be made that the wrapper, decodeFile, could reasonable assume this is the most common case. -- Don

I think he might be saying that decodeFile is not the place for application code, because the lower level cannot possibly know whether it makes sense for there to be residual data in the ByteString. There are plenty of file formats that consist of back-to-back concatenated chunks of data, in which reading just one chunk does not by any means require that a file can only contain one.
Right, but because of the way decodeFile works, whenever you do not have a data type that explicitely checks for EOF in it's Get definition, decodeFile will leak a file handle. There is no way to check that there is residual data, to access it, or to close the file handle. Since this is the normal state of affairs (are there any Get definitions in the current library which check for EOF when done?) I would suggest that this is an API bug. I would suggest that "decodeFile" should check for EOF when done. A second wrapper function "decodePartialFile" could return the unconsumed data, perhaps, for situations when the EOF behavior is not desired, or return some other way for the file to be closed. Additionally, I would suggest that the Data.Binary library provide a combinator for consuming data fully (ie. checking for EOF). ie: fully :: Get a -> Get a fully a = do x <- a e <- isEmpty return $ case e of False -> error "expected EOF" True -> x decodeFully = runGet $ fully get decodeFile fn = decodeFully <$> B.readFile fn to make it easy for developers who do not use the decodeFile primitive to add EOF checking to their marshalling functions. As it currently stands, the most obvious application of the Data.Binary API leads to subtly confusing errors that may go unnoticed for a while. (This would be a fine point for the documentation to address to prevent others from falling in the same hole). I'm currently using definitions like these and (`using` rnf) and have a server that is able to repeatedly read and write the state file. Many thanks to Dons, Brian, Duncan and everyone else who helped me out... Tim Newsham http://www.thenewsh.com/~newsham/

Tim Newsham wrote: [snip]
I would have expected this to fix my problems:
binEof :: Get () binEof = do more <- not <$> isEmpty when more $ error "expected EOF"
decodeFully :: Binary b => B.ByteString -> b decodeFully = runGet (get << binEof) where a << b = a >>= (\x -> b >> return x)
Remember that the Get monad is lazy; the result of binEof is never used, so the action is not performed. decodeFully :: Binary b => B.ByteString -> b decodeFully = runGet (get << binEof) where a << b = a >>= (\x -> b >>= \y -> y `seq` return x) works, for example, and also where a << b = a >>= (\x -> b >>= \y -> return (y `seq` x)) and where (<<) = liftM2 (flip seq) HTH, Bertram

On Wed, 2008-08-13 at 12:02 -1000, Tim Newsham wrote:
However, I think probably the real blame here should probably go to Data.Binary which doesn't attempt to check that it has consumed all of its input after doing a "decode". If "decode" completes and there is unconsumed data, it should probably raise an error (it already raises errors for premature EOF). There's no reason for it not to, since it does not provide the unconsumed data to the caller when its done, anyway...
Thoughts?
I think you're right. The Binary instances cannot and must not read more than they need to, so that gives us the behaviour that we read exactly the length of the file, but no more, and thus we never hit EOF, so we don't close the file. So yes, decode should force the tail so that it can indeed hit EOF. Duncan

duncan.coutts:
On Wed, 2008-08-13 at 12:02 -1000, Tim Newsham wrote:
However, I think probably the real blame here should probably go to Data.Binary which doesn't attempt to check that it has consumed all of its input after doing a "decode". If "decode" completes and there is unconsumed data, it should probably raise an error (it already raises errors for premature EOF). There's no reason for it not to, since it does not provide the unconsumed data to the caller when its done, anyway...
Thoughts?
I think you're right. The Binary instances cannot and must not read more than they need to, so that gives us the behaviour that we read exactly the length of the file, but no more, and thus we never hit EOF, so we don't close the file. So yes, decode should force the tail so that it can indeed hit EOF.
Duncan, You're suggesting that decode and decodeFile should whnf the next cell? -- Don

On Thu, 2008-08-14 at 10:21 -0700, Don Stewart wrote:
I think you're right. The Binary instances cannot and must not read more than they need to, so that gives us the behaviour that we read exactly the length of the file, but no more, and thus we never hit EOF, so we don't close the file. So yes, decode should force the tail so that it can indeed hit EOF.
Duncan,
You're suggesting that decode and decodeFile should whnf the next cell?
At least decodeFile should, since it doesn't give you any other access to the file handle otherwise. Does decode return the tail? I don't remember. If not it should also whnf it. If it does then the user can choose (they might want to do something else with the trailing data). Duncan

duncan.coutts:
On Thu, 2008-08-14 at 10:21 -0700, Don Stewart wrote:
I think you're right. The Binary instances cannot and must not read more than they need to, so that gives us the behaviour that we read exactly the length of the file, but no more, and thus we never hit EOF, so we don't close the file. So yes, decode should force the tail so that it can indeed hit EOF.
Duncan,
You're suggesting that decode and decodeFile should whnf the next cell?
At least decodeFile should, since it doesn't give you any other access to the file handle otherwise.
Does decode return the tail? I don't remember. If not it should also whnf it. If it does then the user can choose (they might want to do something else with the trailing data).
I've pushed a decodeFile that does a whnf on the tail after decoding. If you're at the end of the file, that's sufficient to close the Handle. You'll also need bytestring >= 0.9.1.0 (note, not the one that comes by default with ghc 6.8.x) -- Don

bos:
On Mon, Aug 25, 2008 at 2:28 PM, Don Stewart
wrote: I've pushed a decodeFile that does a whnf on the tail after decoding.
Does this mean that there are now NFData instances for bytestrings? That would be handy.
No, since I can get whnf with `seq`. However, that does sound like a good idea (a patch to the parallel library? )

On Tue, Aug 26, 2008 at 3:04 PM, Don Stewart
No, since I can get whnf with `seq`. However, that does sound like a good idea (a patch to the parallel library? )
I suspect that patching parallel doesn't scale. It doesn't have a maintainer, so it will be slow, and the package will end up dragging in everything under the sun if we centralise instances in there. I think that the instance belongs in bytestring instead. I know that this would make everything depend on parallel, but that doesn't seem as bad a problem.

On Tue, 2008-08-26 at 15:31 -0700, Bryan O'Sullivan wrote:
On Tue, Aug 26, 2008 at 3:04 PM, Don Stewart
wrote: No, since I can get whnf with `seq`. However, that does sound like a good idea (a patch to the parallel library? )
I suspect that patching parallel doesn't scale. It doesn't have a maintainer, so it will be slow, and the package will end up dragging in everything under the sun if we centralise instances in there. I think that the instance belongs in bytestring instead. I know that this would make everything depend on parallel, but that doesn't seem as bad a problem.
This is a general problem we have with packages and instances. Perhaps in this specific case it wouldn't cause many problems to make bytestring depend on parallel (though it means bytestring cannot be a boot lib and cannot be used to implement basic IO) but in general it can be a problem. I can't see any obvious solutions. We don't want lots of tiny packages that just depend on two other packages and define a instance. Duncan

Duncan Coutts wrote:
On Tue, 2008-08-26 at 15:31 -0700, Bryan O'Sullivan wrote:
On Tue, Aug 26, 2008 at 3:04 PM, Don Stewart
wrote: No, since I can get whnf with `seq`. However, that does sound like a good idea (a patch to the parallel library? )
I suspect that patching parallel doesn't scale. It doesn't have a maintainer, so it will be slow, and the package will end up dragging in everything under the sun if we centralise instances in there. I think that the instance belongs in bytestring instead. I know that this would make everything depend on parallel, but that doesn't seem as bad a problem.
This is a general problem we have with packages and instances. Perhaps in this specific case it wouldn't cause many problems to make bytestring depend on parallel (though it means bytestring cannot be a boot lib and cannot be used to implement basic IO) but in general it can be a problem. I can't see any obvious solutions. We don't want lots of tiny packages that just depend on two other packages and define a instance.
Just some raw ideas: What if we had a way to express 'optional dependencies' between packages in a cabal file. Something like 'if package x is installed (and satisfies given version constraints) then add module UseX'. One problem with this idea is that I might install the missing (optional) package afterwards, and then I still do not have the instance I would like (unless I re-build). Could cabal be instructed to re-build a package if an optional dependency becomes available (or if one gets removed)? Cheers Ben

On Thu, 2008-08-28 at 21:34 +0200, Ben Franksen wrote:
Just some raw ideas:
What if we had a way to express 'optional dependencies' between packages in a cabal file. Something like 'if package x is installed (and satisfies given version constraints) then add module UseX'.
One problem with this idea is that I might install the missing (optional) package afterwards, and then I still do not have the instance I would like (unless I re-build). Could cabal be instructed to re-build a package if an optional dependency becomes available (or if one gets removed)?
Right, requiring an order of installation is rather unfortunate. Always rebuilding things doesn't really fit well with many deployment models. Duncan

On Wed, Aug 13, 2008 at 3:02 PM, Tim Newsham
However, I think probably the real blame here should probably go to Data.Binary which doesn't attempt to check that it has consumed all of its input after doing a "decode". If "decode" completes and there is unconsumed data, it should probably raise an error (it already raises errors for premature EOF). There's no reason for it not to, since it does not provide the unconsumed data to the caller when its done, anyway...
You missed runGetState in Data.Binary.Get, which I added. It's definitely not an error in the abstract to have excess input after you're done decoding. In your specific application, it might be, but then you should write a combinator that checks for this state.

bos:
On Tue, Aug 12, 2008 at 6:01 PM, Tim Newsham
wrote: (my keys are dates, which are Enum). This should look at every key in every inner map. Shouldn't that be sufficient to force the entire data set (or do I have to touch the fields in the data elements too?)
You might have to force the last value of the alist that the map gets flattened into, since otherwise there's no guarantee that it will be read.
You really, really want to be using rnf for this job, instead of turning your brain into a pretzel shape.
The Pretzel being one of the lesser-known lazy, cyclic, functional data structures. -- Don

Don Stewart
You really, really want to be using rnf for this job, instead of turning your brain into a pretzel shape.
The Pretzel being one of the lesser-known lazy, cyclic, functional data structures.
So "pretzel-brain" is actually a honorific, rather than derogative term? /me makes mental note. -k -- If I haven't seen further, it is by standing in the footprints of giants

ketil:
Don Stewart
writes: You really, really want to be using rnf for this job, instead of turning your brain into a pretzel shape.
The Pretzel being one of the lesser-known lazy, cyclic, functional data structures.
So "pretzel-brain" is actually a honorific, rather than derogative term? /me makes mental note.
Yes, bestowed upon those who've read Okasaki, and can tie-the-knot. -- Don

Log: savedState.bin: openFile: resource busy (file is locked)
this does not occur if the program wasnt loaded. My best guess here is that B.readFile isnt completing and closing the file for some reason. Is there a good way to force this?
Lazy IO. So force the result to be evaluated, and then close the handle, or use strict bytestring reading.
There is no visible handle. It's all hidden in readFile. I will try forcing the data.
I tried to force the data with: loadState db = do d <- decode <$> B.readFile stateFile let force = sum $ M.elems $ M.size `fmap` d force `seq` atomically $ writeTVar db d and I get the same error when trying to writeFile after doing a loadState.
-- Don
Tim Newsham http://www.thenewsh.com/~newsham/

On Tue, Aug 12, 2008 at 5:34 PM, Tim Newsham
I tried to force the data with:
loadState db = do d <- decode <$> B.readFile stateFile let force = sum $ M.elems $ M.size `fmap` d force `seq` atomically $ writeTVar db d
and I get the same error when trying to writeFile after doing a loadState.
What happens if you simply print the number of elements in the map? Forcing its spine is all you should need.

On Tue, 12 Aug 2008, Bryan O'Sullivan wrote:
On Tue, Aug 12, 2008 at 5:34 PM, Tim Newsham
wrote: I tried to force the data with:
loadState db = do d <- decode <$> B.readFile stateFile let force = sum $ M.elems $ M.size `fmap` d force `seq` atomically $ writeTVar db d
and I get the same error when trying to writeFile after doing a loadState.
What happens if you simply print the number of elements in the map? Forcing its spine is all you should need.
This is what the sum above does. Its a Map of Maps, so this maps M.size over the outter Map and sums the resulting elements. That should touch every map element, at least. Tim Newsham http://www.thenewsh.com/~newsham/

On Wed, Aug 13, 2008 at 1:18 AM, Don Stewart
instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get
Of course I changed this as well. Now it is: instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where put m = put (Map.size m) >> mapM_ put (Map.toAscList m) get = liftM Map.fromDistinctAscList get You don't have to convert the map to list just to compute its size. The Map.size is a O(1) function.

kr.angelov:
On Wed, Aug 13, 2008 at 1:18 AM, Don Stewart
wrote: instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get
Of course I changed this as well. Now it is:
instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where put m = put (Map.size m) >> mapM_ put (Map.toAscList m) get = liftM Map.fromDistinctAscList get
You don't have to convert the map to list just to compute its size. The Map.size is a O(1) function.
If you have a more efficient instance Binary Map, please send a patch. Collaborate! -- Don

On Tue, Aug 12, 2008 at 5:13 PM, Tim Newsham
The data type I'm storing is a Map (of maps):
type DailyDb = M.Map Date Daily type InstrsDb = M.Map String DailyDb
What's going on here?
The default marshalling scheme that Binary uses for lists and maps (which are flattened to lists before writing out) is not streamable. Instead of writing out data in chunks, it computes the length of the list and writes that, followed by the elements. Presumably on the read side, a huge thunk is being built up before any actual Map creation starts.
I also noticed another issue while testing. If my program loads the data at startup by calling loadState then all later calls to saveState give an error:
Log: savedState.bin: openFile: resource busy (file is locked)
this does not occur if the program wasnt loaded.
Your loading of state isn't being forced to complete, so the file handle is still open when you try to save to the same file. The H98 standard requires that file handles be locked for exclusive access during writes. To force the read to finish, use rnf. You can find a description of how to use it, and the typeclasses involved, here: http://book.realworldhaskell.org/beta/concurrent.html

bos:
On Tue, Aug 12, 2008 at 5:13 PM, Tim Newsham
wrote: The data type I'm storing is a Map (of maps):
type DailyDb = M.Map Date Daily type InstrsDb = M.Map String DailyDb
What's going on here?
The default marshalling scheme that Binary uses for lists and maps (which are flattened to lists before writing out) is not streamable. Instead of writing out data in chunks, it computes the length of the list and writes that, followed by the elements. Presumably on the read side, a huge thunk is being built up before any actual Map creation starts.
Maybe it makes sense to have the streamble list instance in Binary as well, with some examples? -- Don

Maybe it makes sense to have the streamble list instance in Binary as well, with some examples?
A flexible format that doesn't sacrifice too much space efficiency would be to encode in chunks of up to 255 elements: Chunk = { length :: Word8 elems :: [Elem] -- 0..255 repetitions } Chunks = [Chunk] -- terminated with the first 0 length Chunk streamable, amortized cost is about 1/256th of the length, and the encoding is much more efficient than the current scheme for short lists (like most strings). Currently a string "foobar" is 8 bytes for length and 7 for the actual string.
-- Don
Tim Newsham http://www.thenewsh.com/~newsham/

Maybe it makes sense to have the streamble list instance in Binary as well, with some examples?
Chunk = { length :: Word8 elems :: [Elem] -- 0..255 repetitions } Chunks = [Chunk] -- terminated with the first 0 length Chunk
I tried my hand at the encoding above: http://www.thenewsh.com/%7Enewsham/store/test10.hs it seems to work, although it doesn't seem to be very efficient. I'm getting very large memory growth when I was hoping it would be lazy and memory efficient... What's leaking?
-- Don
Tim Newsham http://www.thenewsh.com/~newsham/

Old threads never die:
Tim Newsham
Chunk = { length :: Word8 elems :: [Elem] -- 0..255 repetitions } Chunks = [Chunk] -- terminated with the first 0 length Chunk
I tried my hand at the encoding above:
http://www.thenewsh.com/%7Enewsham/store/test10.hs
it seems to work, although it doesn't seem to be very efficient. I'm getting very large memory growth when I was hoping it would be lazy and memory efficient... What's leaking?
Did you ever get to the bottom of this? I have a similar problem with Data.Binary that I don't know how to work around yet. It boils down to reading a large list. This exhibits the problem: newtype Foo = Foo [Word8] instance Binary Foo where get = do xs <- replicateM 10000000 get return (Foo xs) Doing 'x <- decodeFile "/dev/zero" and "case x of Foo y -> take 10 y" blows the heap. I thought Data.Binary was lazy? My actual program looks something like this: instance Binary MyData where get = do header <- get data <- replicateM (data_length header) $ do ....stuff to read a data item return (MyData header data) This blows the stack as soon as I try to access anything, even if it's just the contents of the header. Why? My understanding of how Data.Binary works must be sorely lacking. Could some kind soul please disperse some enlightenment in my direction? -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde
Doing 'x <- decodeFile "/dev/zero"
Well, it turns out 'decodeFile' needs to -- or does, anyway -- check whether the file is empty. Replacing it with a combination of 'decode' and 'readFile' solved the problem. Thanks to Saizan and the other people hanging around on #haskell. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Tue, 2008-08-12 at 14:13 -1000, Tim Newsham wrote:
I also noticed another issue while testing. If my program loads the data at startup by calling loadState then all later calls to saveState give an error:
Log: savedState.bin: openFile: resource busy (file is locked)
You're not using an old version of bytestring are you, anything older than 0.9.0.4? We had a bug where the handle was not closed as soon as we got to the end of the stream, so even forcing the whole input didn't help. Duncan

I had the same problem (stack overflow). The solution was to change
the >>= operator in the Get monad. Currently it is:
m >>= k = Get (\s -> let (a, s') = unGet m s
in unGet (k a) s')
but I changed it to:
m >>= k = Get (\s -> case unGet m s of
(a, s') -> unGet (k a) s')
It seems that the bind operator is lazy and this caused the stack overflow.
I have also another problem. Every Int and Word is stored as 64-bit
value and this expands the output file a lot. I have a lot of integers
and most of them are < 128 but not all of them. I changed the
serialization so that the Int and Word are serialized in a variable
number of bytes. Without this change the binary serialization was even
worse than the textual serialization that we had before. The file was
almost full with zeros.
I just haven't time to prepare a patch and to send it for review but
if other people have the same problem I will do it.
Best Regars,
Krasimir
On Wed, Aug 13, 2008 at 1:13 AM, Tim Newsham
I have a program that read in and populated a large data structure and then saved it out with Data.Binary and Data.ByteString.Lazy.Char8:
saveState db = B.writeFile stateFile =<< encode <$> atomically (readTVar db)
when I go to read this in later I get a stack overflow:
loadState db = do d <- decode <$> B.readFile stateFile atomically $ writeTVar db d
Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it.
or from ghci:
d <- liftM decode (Data.ByteString.Lazy.Char8.readFile "savedState.bin") :: IO InstrsDb
fromList *** Exception: stack overflow
The data type I'm storing is a Map (of maps):
type DailyDb = M.Map Date Daily type InstrsDb = M.Map String DailyDb
What's going on here? Why is the system capable of building and saving the data but not in reading and umarhsalling it? What is the proper way to track down where the exception is happening? Any debugging tips?
I also noticed another issue while testing. If my program loads the data at startup by calling loadState then all later calls to saveState give an error:
Log: savedState.bin: openFile: resource busy (file is locked)
this does not occur if the program wasnt loaded. My best guess here is that B.readFile isnt completing and closing the file for some reason. Is there a good way to force this?
Tim Newsham http://www.thenewsh.com/~newsham/ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

kr.angelov:
I had the same problem (stack overflow). The solution was to change the >>= operator in the Get monad. Currently it is:
m >>= k = Get (\s -> let (a, s') = unGet m s in unGet (k a) s')
but I changed it to:
m >>= k = Get (\s -> case unGet m s of (a, s') -> unGet (k a) s')
It seems that the bind operator is lazy and this caused the stack overflow.
Hmm. That's interesting. I'm not sure that doesn't change other things we rely on though.
I have also another problem. Every Int and Word is stored as 64-bit value and this expands the output file a lot. I have a lot of integers and most of them are < 128 but not all of them. I changed the serialization so that the Int and Word are serialized in a variable number of bytes. Without this change the binary serialization was even worse than the textual serialization that we had before. The file was almost full with zeros.
The motivation for this is to use zlib compress / decompress. E.g. writeFile "f" . compress . encode $ foo
I just haven't time to prepare a patch and to send it for review but if other people have the same problem I will do it.
Patches welcome. You shouldn't need to patch a library like this, it should be able to do what you need. -- Don
participants (10)
-
Antoine Latter
-
Ben Franksen
-
Bertram Felgenhauer
-
Bryan O'Sullivan
-
Don Stewart
-
Duncan Coutts
-
Felipe Lessa
-
Ketil Malde
-
Krasimir Angelov
-
Tim Newsham