
Dr. Olaf Klinke wrote:
I have a function that builds a (cyclic) data structure for searching text. The resulting object can be serialized and re-built. Real-world problems would, however, result in a tree with about 10^9 nodes. I take it that standard data type declarations are not very memory-efficient, so my (fully evaluated) search tree would not fit into memory as a heap object. Its serialised form should be of tractable size, e.g. by storing an array of labels and the tree structure as a bit array of parentheses.
I'd like to separate serialization (IO) from traversal of the stucture (pure), but how can one avoid the entire heap object from being built? Can lazyness be exploited?
As a rule of thumb, if you want tight control over memory usage, then you need to stick to an execution model that gives you that control -- so not lazy evaluation. It certainly is possible to reason about memory usage with lazy evaluation, but it's not straightforward. Here's an idea that could work for you: The problem as you described it is that you want to use an algebraic data type data Tree a = Bin (Tree a) (Tree a) | Leaf a but for reasons for space usage, you have to represent it as a compressed byte array, for instance like this type Tree = ByteString The drawback of the latter representation is that you lose pattern matching, but there is actually a way to sort of avoid that. In particular, consider the following function for the algebraic data type: caseof :: (Tree a -> Tree a -> b) -> (a -> b) -> (Tree a -> b) caseof f g tree = case tree of Bin x y -> f x y Leaf a -> g a This is essentially a case-of expression, written as a function with two argument (the two cases). If you implement a function like this for your byte array `Tree`, then you can formulate your traversal in terms of this case-of expression/function, which handles all the decoding issues. Essentially, your `Tree` data type has to be a pointer to the tip of a subtree. You can also augment the `caseof` function to be in the IO monad if you want. This way, you can avoid loading the tree in memory altogether. Again, the `caseof` function takes care of decoding the data format, while the traversal uses `caseof` to make decisions of which parts of the tree to inspect. This trick is also known as "Scott encoding", or "Mogensen-Scott encoding". It is very similar to the more well-known "Church encoding", but they differ in the case of recursive data types. Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com