Information on B-Tree IO implemenations

Hello all, (I'm resending this because I sent the first one without being a member of the list and I'm assuming it bounced) I'm a newbie and while I grasp some of the basics of FP I have yet to really understand the guts of Haskell. The type system is starting to clear up for me and so are Monads but I have a long way to go. For a project that I'm working on I need to implement a B-Tree or similar disk based data structure for storing large amounts of data. I feel like I'm getting it and then I get hung up on disk IO operations and trying to figure out how they should fit into the mix. I'm not necessarily looking for someone to point me at an implementation (although I wouldn't mind) but if someone could point me in the direction of some material that could shed a light on things I would be much obliged. I've read some Okasaki and found a promising paper called "Database Manipulation in Haskell 1.3" by Kevin Hammond and Phil Trinder but while they talk to the data structures themselves they don't quite get around to mentioning how they achieve the IO part of the equation. Thanks, Scotty Weeks

Scott Weeks
For a project that I'm working on I need to implement a B-Tree or similar disk based data structure for storing large amounts of data. I feel like I'm getting it and then I get hung up on disk IO operations and trying to figure out how they should fit into the mix.
Here is a first-cut specification of a B-Tree: data BTree k d = BTree Int [(k,[d])] [BTree k d] where k = key and d = stored data. Obviously this does not actually store the B-Tree on file, it just gives a flavour of the intended structure. In order to store the data on file, we need to introduce an indirection corresponding to a file pointer, on each of the child nodes. data BTree k d = BTree Int [(k,[d])] [FilePtr (BTree k d)] Then you need a serialisation routine to write any one node of the tree out and read it back in. And your tree-traversal routines (lookup, insert, etc) need to be embedded in the IO monad, so they can do the necessary file access. Attached is an implementation of B-Trees (by Colin Runciman), which uses nhc98's Binary class http://haskell.org/nhc98/libs/Binary.html for serialisation. The type BinHandle corresponds to a file handle, and Bin corresponds to a file pointer. In this code, in fact we managed to keep the lookup function pure, because the Binary interface allows referentially-transparent I/O (getFAt), which may not be possible in other serialisation libraries. This facility however depends on an invariant that the B-Tree can only be extended - there is no operation to remove nodes or data. However the insertion routine clearly must use I/O to extend the tree. I hope you can see the pattern of how this I/O works. Note also that the B-Tree structure is stored in BPages, separate from the data values hanging off the tree, which are in DBlocks. Hope this helps, Regards, Malcolm

Thanks Malcolm, that is a great explanation.
In this code, in fact we managed to keep the lookup function pure, because the Binary interface allows referentially-transparent I/O (getFAt), which may not be possible in other serialisation libraries. This facility however depends on an invariant that the B-Tree can only be extended - there is no operation to remove nodes or data.
That's perfect. The insertion functions clear up the IO part of the equation immensely and since I will need to be able to delete nodes and data it will allow me to implement search/delete myself and cement a bit of this in my head. Cheers, that was a great help. On 07/11/2005, at 11:12 PM, Malcolm Wallace wrote:
Scott Weeks
writes: For a project that I'm working on I need to implement a B-Tree or similar disk based data structure for storing large amounts of data. I feel like I'm getting it and then I get hung up on disk IO operations and trying to figure out how they should fit into the mix.
Here is a first-cut specification of a B-Tree:
data BTree k d = BTree Int [(k,[d])] [BTree k d]
where k = key and d = stored data. Obviously this does not actually store the B-Tree on file, it just gives a flavour of the intended structure.
In order to store the data on file, we need to introduce an indirection corresponding to a file pointer, on each of the child nodes.
data BTree k d = BTree Int [(k,[d])] [FilePtr (BTree k d)]
Then you need a serialisation routine to write any one node of the tree out and read it back in. And your tree-traversal routines (lookup, insert, etc) need to be embedded in the IO monad, so they can do the necessary file access.
Attached is an implementation of B-Trees (by Colin Runciman), which uses nhc98's Binary class http://haskell.org/nhc98/libs/Binary.html for serialisation. The type BinHandle corresponds to a file handle, and Bin corresponds to a file pointer. In this code, in fact we managed to keep the lookup function pure, because the Binary interface allows referentially-transparent I/O (getFAt), which may not be possible in other serialisation libraries. This facility however depends on an invariant that the B-Tree can only be extended - there is no operation to remove nodes or data. However the insertion routine clearly must use I/O to extend the tree. I hope you can see the pattern of how this I/O works.
Note also that the B-Tree structure is stored in BPages, separate from the data values hanging off the tree, which are in DBlocks.
Hope this helps, Regards, Malcolm
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (2)
-
Malcolm Wallace
-
Scott Weeks