
Thanks Malcolm, that is a great explanation.
In this code, in fact we managed to keep the lookup function pure, because the Binary interface allows referentially-transparent I/O (getFAt), which may not be possible in other serialisation libraries. This facility however depends on an invariant that the B-Tree can only be extended - there is no operation to remove nodes or data.
That's perfect. The insertion functions clear up the IO part of the equation immensely and since I will need to be able to delete nodes and data it will allow me to implement search/delete myself and cement a bit of this in my head. Cheers, that was a great help. On 07/11/2005, at 11:12 PM, Malcolm Wallace wrote:
Scott Weeks
writes: For a project that I'm working on I need to implement a B-Tree or similar disk based data structure for storing large amounts of data. I feel like I'm getting it and then I get hung up on disk IO operations and trying to figure out how they should fit into the mix.
Here is a first-cut specification of a B-Tree:
data BTree k d = BTree Int [(k,[d])] [BTree k d]
where k = key and d = stored data. Obviously this does not actually store the B-Tree on file, it just gives a flavour of the intended structure.
In order to store the data on file, we need to introduce an indirection corresponding to a file pointer, on each of the child nodes.
data BTree k d = BTree Int [(k,[d])] [FilePtr (BTree k d)]
Then you need a serialisation routine to write any one node of the tree out and read it back in. And your tree-traversal routines (lookup, insert, etc) need to be embedded in the IO monad, so they can do the necessary file access.
Attached is an implementation of B-Trees (by Colin Runciman), which uses nhc98's Binary class http://haskell.org/nhc98/libs/Binary.html for serialisation. The type BinHandle corresponds to a file handle, and Bin corresponds to a file pointer. In this code, in fact we managed to keep the lookup function pure, because the Binary interface allows referentially-transparent I/O (getFAt), which may not be possible in other serialisation libraries. This facility however depends on an invariant that the B-Tree can only be extended - there is no operation to remove nodes or data. However the insertion routine clearly must use I/O to extend the tree. I hope you can see the pattern of how this I/O works.
Note also that the B-Tree structure is stored in BPages, separate from the data values hanging off the tree, which are in DBlocks.
Hope this helps, Regards, Malcolm
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe