
I would like to read a large finitemap off of a disk faster than the time it takes to read the entire list of pairs. My solution is to save it as a bunch of smaller lists of pairs covering various key intervals (or recency intervals). I then can readfile all of these lists back into a bunch of finitemaps. Because readfile is lazy my program can start very quickly without waiting to read the contents of all of these files. To make this solution work, I need some datastructure that looks like a finitemap but operates on a bunch of them organized either in parallel (key interval model) or series (time interval model). Is there some pre-existing data structure that does this or do I need to roll my own? Or is there a better way to (de-)serialize FiniteMaps? -Alex- ______________________________________________________________ S. Alexander Jacobson tel:917-770-6565 http://alexjacobson.com

On Thu, 2004-12-09 at 11:04 -0500, S. Alexander Jacobson wrote:
Or is there a better way to (de-)serialize FiniteMaps?
There's a neat trick that we use in c2hs (at least the patched version shipped with gtk2hs) to strictly read all the keys of the finite map but read all the values lazily. This gives massive savings (from 10s of sec to .1s of sec) since for this application the keys are small (an int or two) and only a small number of items get looked up. This uses ghc's Binary (de)serialisation module. The trick is to lazy reading/writing is to store offsets so that items can be skipped over and later seek back to them if the value was demanded. The actual details are already implemented in the Binary module as lazyPut/lazyGet. So for FiniteMap, what I do is: instance (Binary key, Ord key, Binary elem) => Binary (FiniteMap key elem) where -- This would be the ordinary strict implementation -- put_ bh fm = put_ bh (fmToList fm) -- get bh = do list <- get bh -- return (listToFM list) put_ bh fm = do let list = fmToList fm put_ bh (length list) mapM_ (\(key, val) -> do put_ bh key lazyPut bh val) list get bh = do len <- get bh let getMany :: Int -> IO [(key,elem)] getMany 0 = return [] getMany n = do key <- get bh val <- lazyGet bh xs <- getMany (n-1) return ((key,val):xs) list <- getMany len return (listToFM list) So as you can see, on deserialisation we build a perfectly ordinary FiniteMap but all the values in the map are thunks that will on-demand read and deserialise the value from disk. Actually, this is a rather nice example of using laziness. In a strict language to do this lazy reading trick you would have to change every user of FiniteMap. Of course all this relies on doing binary rather than textual serialisation, otherwise byte offsets into the file make no sense and you couldn't skip over stuff and come back to it later. Duncan

Hmm cool. Binary does not appear in the GHC docs... Is it new? -Alex- On Thu, 9 Dec 2004, Duncan Coutts wrote:
On Thu, 2004-12-09 at 11:04 -0500, S. Alexander Jacobson wrote:
Or is there a better way to (de-)serialize FiniteMaps?
There's a neat trick that we use in c2hs (at least the patched version shipped with gtk2hs) to strictly read all the keys of the finite map but read all the values lazily. This gives massive savings (from 10s of sec to .1s of sec) since for this application the keys are small (an int or two) and only a small number of items get looked up.
This uses ghc's Binary (de)serialisation module. The trick is to lazy reading/writing is to store offsets so that items can be skipped over and later seek back to them if the value was demanded. The actual details are already implemented in the Binary module as lazyPut/lazyGet.
So for FiniteMap, what I do is:
instance (Binary key, Ord key, Binary elem) => Binary (FiniteMap key elem) where
-- This would be the ordinary strict implementation -- put_ bh fm = put_ bh (fmToList fm) -- get bh = do list <- get bh -- return (listToFM list)
put_ bh fm = do let list = fmToList fm put_ bh (length list) mapM_ (\(key, val) -> do put_ bh key lazyPut bh val) list get bh = do len <- get bh let getMany :: Int -> IO [(key,elem)] getMany 0 = return [] getMany n = do key <- get bh val <- lazyGet bh xs <- getMany (n-1) return ((key,val):xs) list <- getMany len return (listToFM list)
So as you can see, on deserialisation we build a perfectly ordinary FiniteMap but all the values in the map are thunks that will on-demand read and deserialise the value from disk.
Actually, this is a rather nice example of using laziness. In a strict language to do this lazy reading trick you would have to change every user of FiniteMap.
Of course all this relies on doing binary rather than textual serialisation, otherwise byte offsets into the file make no sense and you couldn't skip over stuff and come back to it later.
Duncan
______________________________________________________________ S. Alexander Jacobson tel:917-770-6565 http://alexjacobson.com

On Mon, 2004-12-13 at 14:56 -0500, S. Alexander Jacobson wrote:
Hmm cool. Binary does not appear in the GHC docs... Is it new?
Ah, it's not a user library distributed with GHC, it's a library used internally in the implementation of GHC. If you want to use it you have to rip it out of the ghc source tree. :-) That said, it's actually quite a free-standing module. A version cleaned of most ghc dependencies (it still depends on FastMutInt) is available here: http://cvs.sourceforge.net/viewcvs.py/*checkout*/gtk2hs/gtk2hs/tools/c2hs/ba... http://cvs.sourceforge.net/viewcvs.py/*checkout*/gtk2hs/gtk2hs/tools/c2hs/ba... One quick hack I made that you might need to fix is that I did: #define SIZEOF_HSINT 4 whereas of course SIZEOF_HSINT should come from /usr/lib/ghc-$VER/include/MachDeps.h or some config.h file generated by ./configure. Of course that is what the version in ghc's sources does. Duncan
participants (2)
-
Duncan Coutts
-
S. Alexander Jacobson