
Scott Weeks
For a project that I'm working on I need to implement a B-Tree or similar disk based data structure for storing large amounts of data. I feel like I'm getting it and then I get hung up on disk IO operations and trying to figure out how they should fit into the mix.
Here is a first-cut specification of a B-Tree: data BTree k d = BTree Int [(k,[d])] [BTree k d] where k = key and d = stored data. Obviously this does not actually store the B-Tree on file, it just gives a flavour of the intended structure. In order to store the data on file, we need to introduce an indirection corresponding to a file pointer, on each of the child nodes. data BTree k d = BTree Int [(k,[d])] [FilePtr (BTree k d)] Then you need a serialisation routine to write any one node of the tree out and read it back in. And your tree-traversal routines (lookup, insert, etc) need to be embedded in the IO monad, so they can do the necessary file access. Attached is an implementation of B-Trees (by Colin Runciman), which uses nhc98's Binary class http://haskell.org/nhc98/libs/Binary.html for serialisation. The type BinHandle corresponds to a file handle, and Bin corresponds to a file pointer. In this code, in fact we managed to keep the lookup function pure, because the Binary interface allows referentially-transparent I/O (getFAt), which may not be possible in other serialisation libraries. This facility however depends on an invariant that the B-Tree can only be extended - there is no operation to remove nodes or data. However the insertion routine clearly must use I/O to extend the tree. I hope you can see the pattern of how this I/O works. Note also that the B-Tree structure is stored in BPages, separate from the data values hanging off the tree, which are in DBlocks. Hope this helps, Regards, Malcolm