Re: [Haskell-cafe] Very large data structures in the heap

19 Feb 2015

      Dr. Olaf Klinke wrote:
...
I have a function that builds a (cyclic) data structure for searching 
text. The resulting object can be serialized and re-built. Real-world 
problems would, however, result in a tree with about 10^9 nodes. I take 
it that standard data type declarations are not very memory-efficient, 
so my (fully evaluated) search tree would not fit into memory as a heap 
object. Its serialised form should be of tractable size, e.g. by storing 
an array of labels and the tree structure as a bit array of parentheses.
I'd like to separate serialization (IO) from traversal of the stucture 
(pure), but how can one avoid the entire heap object from being built? 
Can lazyness be exploited?
As a rule of thumb, if you want tight control over memory usage, then 
you need to stick to an execution model that gives you that control -- 
so not lazy evaluation. It certainly is possible to reason about memory 
usage with lazy evaluation, but it's not straightforward.

Here's an idea that could work for you: The problem as you described it 
is that you want to use an algebraic data type

    data Tree a = Bin (Tree a) (Tree a) | Leaf a

but for reasons for space usage, you have to represent it as a 
compressed byte array, for instance like this

    type Tree = ByteString

The drawback of the latter representation is that you lose pattern 
matching, but there is actually a way to sort of avoid that. In 
particular, consider the following function for the algebraic data type:

    caseof :: (Tree a -> Tree a -> b) -> (a -> b) -> (Tree a -> b)
    caseof f g tree = case tree of
         Bin x y -> f x y
         Leaf a  -> g a

This is essentially a case-of expression, written as a function with two 
argument (the two cases). If you implement a function like this for your 
byte array `Tree`, then you can formulate your traversal in terms of 
this case-of expression/function, which handles all the decoding issues. 
Essentially, your `Tree` data type has to be a pointer to the tip of a 
subtree.

You can also augment the `caseof` function to be in the IO monad if you 
want. This way, you can avoid loading the tree in memory altogether. 
Again, the `caseof` function takes care of decoding the data format, 
while the traversal uses `caseof` to make decisions of which parts of 
the tree to inspect.

This trick is also known as "Scott encoding", or "Mogensen-Scott 
encoding". It is very similar to the more well-known "Church encoding", 
but they differ in the case of recursive data types.

Best regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com