Data.Binary stack overflow with Data.Sequence String

So recently I've been having issues with Data.Binary & Data.Sequence; I serialize a 'Seq String' You can see the file here: http://code.haskell.org/yi/Yi/IReader.hs The relevant function seems to be: -- | Read in database from 'dbLocation' and then parse it into an 'ArticleDB'. readDB :: YiM ArticleDB readDB = io $ (dbLocation >>= r) `catch` (\_ -> return empty) where r x = fmap (decode . BL.fromChunks . return) $ B.readFile x -- We read in with strict bytestrings to guarantee the file is closed, -- and then we convert it to the lazy bytestring data.binary expects. -- This is inefficient, but alas... My current serialized file is about 9.4M. I originally thought that the issue might be the recent upgrade in Yi to binary 0.5, but I unpulled patches back to past that, and the problem still manifested. Whenever yi tries to read the articles.db file, it stack overflows. It actually stack-overflowed on even smaller files, but I managed to bump the size upwards, it seems, by the strict-Bytestring trick. Unfortunately, my personal file has since passed whatever that limit was. I've read carefully the previous threads on Data.Binary and Data.Map stack-overflows, but none of them seem to help; hacking some $!s or seqs into readDB seems to make no difference, and Seq is supposed to be a strict datastructure already! Doing things in GHCi has been tedious, and hasn't enlightened me much: sometimes things overflow and sometimes they don't. It's all very frustrating and I'm seriously considering going back to using the original read/show code unless anyone knows how to fix this - that approach may be many times slower, but I know it will work. -- gwern

I have collected some of the backing code to "decode". This is all pasted below so we can look at it. I will not improperly guess at the cause of the problem, and be totally wrong. I observe Get is a lazy monad: Prelude Data.Binary Data.Binary.Get Data.Monoid> "World" == runGet ((return $! undefined) >> return "World") mempty True Prelude Data.Binary Data.Binary.Get Data.Monoid> 'W' == head (runGet ((return $! undefined) >>= \t -> return $! ('W':t)) mempty) True Prelude Data.Binary Data.Binary.Get Data.Monoid> "orld" == tail (runGet ((return $! undefined) >>= \h -> return $! (h:"orld")) mempty) True This may have implication for building the "String" from "Char". The get for "Char" uses "return $! char" but this is no good unless the Char is being forced, as the (return $! undefined) above shows. The instance Get [a] inherits the laziness of replicateM which is sequence. The instance (Seq.Seq e) does not force the "x :: String" value. And even if it did it would only force the leading (:) cons cell and not the characters themselves. The instance is strict in what passes for the spine of the Seq, not the contents, and certainly not the deep contents. You might try using "newtype" when deserializing ArticleDB and make a much stricter version of the code. All the relevant code (yi & binary & ghc):
type Article = String type ArticleDB = Seq Article
-- | Read in database from 'dbLocation' and then parse it into an 'ArticleDB'. readDB :: YiM ArticleDB readDB = io $ (dbLocation >>= r) `catch` (\_ -> return empty) where r x = fmap (decode . BL.fromChunks . return) $ B.readFile x -- We read in with strict bytestrings to guarantee the file is closed, -- and then we convert it to the lazy bytestring data.binary expects. -- This is inefficient, but alas...
decode :: Binary a => ByteString -> a decode = runGet get
instance (Binary e) => Binary (Seq.Seq e) where put s = put (Seq.length s) >> Fold.mapM_ put s get = do n <- get :: Get Int rep Seq.empty n get where rep xs 0 _ = return $! xs rep xs n g = xs `seq` n `seq` do x <- g rep (xs Seq.|> x) (n-1) g
instance Binary Int where put i = put (fromIntegral i :: Int64) get = liftM fromIntegral (get :: Get Int64)
instance Binary Int64 where put i = put (fromIntegral i :: Word64) get = liftM fromIntegral (get :: Get Word64)
instance Binary Word64 where put = putWord64be get = getWord64be
instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get = do n <- get :: Get Int replicateM n get
-- Char is serialised as UTF-8 instance Binary Char where put a | c <= 0x7f = put (fromIntegral c :: Word8) | c <= 0x7ff = do put (0xc0 .|. y) put (0x80 .|. z) | c <= 0xffff = do put (0xe0 .|. x) put (0x80 .|. y) put (0x80 .|. z) | c <= 0x10ffff = do put (0xf0 .|. w) put (0x80 .|. x) put (0x80 .|. y) put (0x80 .|. z) | otherwise = error "Not a valid Unicode code point" where c = ord a z, y, x, w :: Word8 z = fromIntegral (c .&. 0x3f) y = fromIntegral (shiftR c 6 .&. 0x3f) x = fromIntegral (shiftR c 12 .&. 0x3f) w = fromIntegral (shiftR c 18 .&. 0x7)
get = do let getByte = liftM (fromIntegral :: Word8 -> Int) get shiftL6 = flip shiftL 6 :: Int -> Int w <- getByte r <- case () of _ | w < 0x80 -> return w | w < 0xe0 -> do x <- liftM (xor 0x80) getByte return (x .|. shiftL6 (xor 0xc0 w)) | w < 0xf0 -> do x <- liftM (xor 0x80) getByte y <- liftM (xor 0x80) getByte return (y .|. shiftL6 (x .|. shiftL6 (xor 0xe0 w))) | otherwise -> do x <- liftM (xor 0x80) getByte y <- liftM (xor 0x80) getByte z <- liftM (xor 0x80) getByte return (z .|. shiftL6 (y .|. shiftL6 (x .|. shiftL6 (xor 0xf0 w)))) return $! chr r
replicateM :: (Monad m) => Int -> m a -> m [a] replicateM n x = sequence (replicate n x)
sequence :: Monad m => [m a] -> m [a] {-# INLINE sequence #-} sequence ms = foldr k (return []) ms where k m m' = do { x <- m; xs <- m'; return (x:xs) }
-- Chris
participants (2)
-
ChrisK
-
Gwern Branwen