
wren ng thornton wrote:
If you have many identical strings then you will save lots by memoizing your strings into Integers, and then serializing that memo table and the integerized version of your data structure. The amount of savings decreases as the number of duplications decrease, though since you don't need the memo table itself you should be able to serialize it in a way that doesn't have much overhead.
I had problems with the size of the allocated heap space after serializing and loading data with the binary package. The reason was that binary does not support sharing of identical elements and considered a restricted solution for strings and certain other data types first, but came up with a generic solution in the end. (I did it just last weekend). I put the Binary monad in a state transformer with maps for memoization: type PutShared = St.StateT (Map Object Int, Int) PutM () type GetShared = St.StateT (IntMap Object) Bin.Get In addition to standard get ant put methods: class (Typeable α, Ord α, Eq α) ⇒ BinaryShared α where put :: α → PutShared get :: GetShared α I added putShared and getShared methods with memoization: putShared :: (α → PutShared) → α → PutShared getShared :: GetShared α → GetShared α For types that I don't want memoization I can either refer to the underlying binary monad for primitive types, e.g.: instance BinaryShared Int where put = lift∘Bin.put get = lift Bin.get or stay in the BinaryShared monad for types of which I may memoize components, e.g.: instance (BinaryShared a, BinaryShared b) ⇒ BinaryShared (a,b) where put (a,b) = put a ≫ put b get = liftM2 (,) get get And for types for which I want memoization, I wrap it with putShared and getShared ,e.g: instance BinaryShared a ⇒ BinaryShared [a] where put = putShared (λl → lift (Bin.put (length l)) ≫ mapM_ put l) get = getShared (do n ← lift (Bin.get :: Bin.Get Int) replicateM n get) This save 1/3 of heap space to my application. I didn't measure time. Maybe it would be useful to have something like this in the binary module. Jürgen PS: And here is the dirty implementation, in the case someone finds it useful: putShared :: (α → PutShared) → α → PutShared putShared fput v = do (dict, unique) ← St.get case (ObjC v) `Map.lookup` dict of Just i → lift (Bin.putWord8 0 ≫ putWord64be (fromIntegral i)) Nothing → do St.put (dict,unique + 1) lift (Bin.putWord8 1) lift (putWord64be (fromIntegral unique)) fput v (dict2, unique2) ← St.get let newDict = Map.insert (ObjC v) unique dict2 St.put (newDict,unique2) getShared :: GetShared α → GetShared α getShared f = do dict ← St.get w ← lift Bin.getWord8 case w of 0 → do i ← lift (liftM fromIntegral (getWord64be)) case IMap.lookup i dict of Just (ObjC obj) → return (forceJust (cast obj) "Shared≫getShared: Cast failed") Nothing → error ◊ "Shared≫getShared : Dont find in Map " ⊕ show i 1 → do i ← lift (liftM fromIntegral (getWord64be)) obj ← f dict2 ← St.get St.put (IMap.insert i (ObjC obj) dict2) return obj _ → error ◊ "Shared≫getShared : Encoding error" data Object = ∀ α. (Typeable α, Ord α, Eq α) ⇒ ObjC {unObj :: α} instance Eq Object where (ObjC a) ≡ (ObjC b) = if typeOf a ≠ typeOf b then False else (Just a) ≡ cast b -- can someone explain to me why this works? instance Ord Object where compare (ObjC a) (ObjC b) = if typeOf a ≠ typeOf b then compare ((unsafePerformIO∘typeRepKey∘typeOf) a) ((unsafePerformIO∘typeRepKey∘typeOf) b) else compare (Just a) (cast b) encodeSer :: BinaryShared a ⇒ a → L.ByteString encodeSer v = runPut (evalStateT (put v) (Map.empty,0)) decodeSer :: BinaryShared α ⇒ L.ByteString → α decodeSer = runGet (evalStateT get IMap.empty) -- View this message in context: http://www.nabble.com/Data.Binary-poor-read-performance-tp22167466p22192337.... Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.