Very large data structures in the heap

I have a function that builds a (cyclic) data structure for searching text. The resulting object can be serialized and re-built. Real-world problems would, however, result in a tree with about 10^9 nodes. I take it that standard data type declarations are not very memory-efficient, so my (fully evaluated) search tree would not fit into memory as a heap object. Its serialised form should be of tractable size, e.g. by storing an array of labels and the tree structure as a bit array of parentheses. I'd like to separate serialization (IO) from traversal of the stucture (pure), but how can one avoid the entire heap object from being built? Can lazyness be exploited? Olaf

On Tue, Feb 17, 2015 at 5:05 PM, Dr. Olaf Klinke < o.klinke@dkfz-heidelberg.de> wrote:
I'd like to separate serialization (IO) from traversal of the stucture (pure), but how can one avoid the entire heap object from being built? Can lazyness be exploited?
Something I didn't quite understand from your email is whether 1. the data structure is being constructed and you'd like it simultaneously serialized to disk or 2. it's already serialized and you'd like to perform pure traversals on it. Problem 1 is one of compute then write, problem 2 is one of read then compute. Lazy I/O, as in unsafeInterleaveIO, excels at 2. -- Kim-Ee

Hallo Olaf, Am Dienstag, den 17.02.2015, 11:05 +0100 schrieb Dr. Olaf Klinke:
I have a function that builds a (cyclic) data structure for searching text. The resulting object can be serialized and re-built. Real-world problems would, however, result in a tree with about 10^9 nodes. I take it that standard data type declarations are not very memory-efficient, so my (fully evaluated) search tree would not fit into memory as a heap object. Its serialised form should be of tractable size, e.g. by storing an array of labels and the tree structure as a bit array of parentheses.
you say cyclic, but you also say it is a tree. Do you mean that it locally looks like a tree, but is in fact a graph, due to self-references?
I'd like to separate serialization (IO) from traversal of the stucture (pure), but how can one avoid the entire heap object from being built? Can lazyness be exploited?
I would say „yes“, but need more information for a more concrete answer :-) Grüße aus Karlsruhe, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Dr. Olaf Klinke wrote:
I have a function that builds a (cyclic) data structure for searching text. The resulting object can be serialized and re-built. Real-world problems would, however, result in a tree with about 10^9 nodes. I take it that standard data type declarations are not very memory-efficient, so my (fully evaluated) search tree would not fit into memory as a heap object. Its serialised form should be of tractable size, e.g. by storing an array of labels and the tree structure as a bit array of parentheses.
I'd like to separate serialization (IO) from traversal of the stucture (pure), but how can one avoid the entire heap object from being built? Can lazyness be exploited?
As a rule of thumb, if you want tight control over memory usage, then you need to stick to an execution model that gives you that control -- so not lazy evaluation. It certainly is possible to reason about memory usage with lazy evaluation, but it's not straightforward. Here's an idea that could work for you: The problem as you described it is that you want to use an algebraic data type data Tree a = Bin (Tree a) (Tree a) | Leaf a but for reasons for space usage, you have to represent it as a compressed byte array, for instance like this type Tree = ByteString The drawback of the latter representation is that you lose pattern matching, but there is actually a way to sort of avoid that. In particular, consider the following function for the algebraic data type: caseof :: (Tree a -> Tree a -> b) -> (a -> b) -> (Tree a -> b) caseof f g tree = case tree of Bin x y -> f x y Leaf a -> g a This is essentially a case-of expression, written as a function with two argument (the two cases). If you implement a function like this for your byte array `Tree`, then you can formulate your traversal in terms of this case-of expression/function, which handles all the decoding issues. Essentially, your `Tree` data type has to be a pointer to the tip of a subtree. You can also augment the `caseof` function to be in the IO monad if you want. This way, you can avoid loading the tree in memory altogether. Again, the `caseof` function takes care of decoding the data format, while the traversal uses `caseof` to make decisions of which parts of the tree to inspect. This trick is also known as "Scott encoding", or "Mogensen-Scott encoding". It is very similar to the more well-known "Church encoding", but they differ in the case of recursive data types. Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com
participants (4)
-
Dr. Olaf Klinke
-
Heinrich Apfelmus
-
Joachim Breitner
-
Kim-Ee Yeoh