Re: [Haskell-cafe] Very large data structures in the heap

On Tue, Feb 17, 2015 at 8:26 PM, Dr. Olaf Klinke < o.klinke@dkfz-heidelberg.de> wrote:
On Tue, 17 Feb 2015, Kim-Ee Yeoh wrote:
On Tue, Feb 17, 2015 at 5:05 PM, Dr. Olaf Klinke < o.klinke@dkfz-heidelberg.de> wrote: I'd like to separate serialization (IO) from traversal of the stucture (pure), but how can one avoid the entire heap object from being built? Can lazyness be exploited?
Something I didn't quite understand from your email is whether
1. the data structure is being constructed and you'd like it simultaneously serialized to disk
or
2. it's already serialized and you'd like to perform pure traversals on it.
Problem 1 is one of compute then write, problem 2 is one of read then compute.
Lazy I/O, as in unsafeInterleaveIO, excels at 2.
-- Kim-Ee
Both.
The data structure is a search tree, which is intended to be used multiple times, so 1. applies. I guess that the only way to circumvent the memory bottleneck would be to serialize the part of the structure already constructed and use the already serialized part for constructing the rest, somehow without de-serializing it again.
To an extent, the virtual memory of a modern operating system solves the memory bottleneck. I'm curious what you were hoping for when thinking of exploiting laziness. If the search tree is truly huge, you'd probably resort to fancy (or fuzzy or both) algorithms to break up the work.
2. also applies. Queries to the search tree traverse different paths of it, but one can not rule out that multiple queries will eventually have traversed the entire tree. Thus lazily de-serializing it may not be enough unless the garbage collector disposes of parts of the tree after each query. Which it won't, if we say something like this:
search :: FilePath -> IO (Query -> Result) search treeOnDisk = do searchtree <- decodeFile treeOnDisk return (\query -> searchFor query searchtree)
No, gc _will_ reclaim memory if decodeFile uses unsafeInterleaveIO. Take a look at the lazy bytestrings package. p.s. Your search function looks odd. Surely you mean something with the type signature: FilePath -> Query -> IO Result ? -- Kim-Ee

To an extent, the virtual memory of a modern operating system solves the memory bottleneck.
I'm curious what you were hoping for when thinking of exploiting laziness. If the search tree is truly huge, you'd probably resort to fancy (or fuzzy or both) algorithms to break up the work. -- Kim-Ee
Here is an update: I considered comments made by Kim-Ee and Jachim Breitner and decided to represent my search tree as an unboxed IOVector that stores indices in its elements. The code now looks very imperative. It is a bit simpler though as the pure tying-the-knot code that builds the cyclic structure. I tried several approaches for lookups: (i) Reconstruct the search tree (lazily) from the vector each time a query is made. A relatively small part should actually be constructed and garbage-collected. This turned out to be not overwhelmingly fast. (ii) Keep the search tree structure implicit. Instead, descend the tree by doing reads on the vector and follow the included pointers around. This was faster. (iii) Combine both: After a pre-specified number of lookup/insert cycles, reconstruct a part of the search tree in the heap and traverse this, keeping the same tree for subsequent lookups. When reaching a leaf in the heap, continue hopping around the IOVector. Since the size of the vector is not known beforehand, inserts use Data.Vector.Unboxed.Mutable.grow from time to time. As I read here (https://github.com/haskell/vector/issues/36) and tested on my machine, growing the IOVector makes a copy of it. So after my IOVector has grown to considerable size, additional inserts slow down considerably. However, when constructing a tree of final size 1.2 million nodes, the approach (iii) already outperforms my pure algorithm. A tree with 18 million nodes is no problem to construct. -- Olaf
participants (2)
-
Dr. Olaf Klinke
-
Kim-Ee Yeoh