
In Real World Haskell, the authors state that "the attraction of a tree to a functional programmer is cheap modification." They go on to explain that in a tree of say 10,000 nodes, approx 9,985 would be reused (on average) when updating a single node (rather than having to copy all 10k nodes, like a Haskell list would have to do). In my application, I have a collection of (only) about 200 objects. So from the surface, not large at all. However, each object in the collection is a large tree that might be up to 20 MB in size. The application pulls objects one at a time from the collection, performs some processing, and delivers a new, updated object that needs to be inserted in the collection for future processing (the original object is now obsolete). I'm guessing a Data.IntMap might involve copying over about 8 objects for each single object update, which while better than copying all 200, is still not perfect. What would be the most efficient Haskelly way to maintain my collection? I'm wondering also if there is some sort of "pointer list" I could use. I think recreating a list of 200 pointers (199 of which are unchanged) might be a pretty efficient solution, if it exists. Sadly, I've tried to understand Mutable Arrays (of a few different flavors), but all the monadic stuff is over my head at the moment, and I couldn't find sufficient documentation/tutorial that I could understand to implement them.

Excerpts from Travis Erdman's message of Tue Apr 06 15:31:30 -0400 2010:
In my application, I have a collection of (only) about 200 objects. So from the surface, not large at all. However, each object in the collection is a large tree that might be up to 20 MB in size. The application pulls objects one at a time from the collection, performs some processing, and delivers a new, updated object that needs to be inserted in the collection for future processing (the original object is now obsolete). I'm guessing a Data.IntMap might involve copying over about 8 objects for each single object update, which while better than copying all 200, is still not perfect.
The construction of Haskell tree types is such that the tree is automatically storing pointers to the actual object: it's only if you ask that the value be unboxed explicitly that it gets stored with the node. So you're only paying for a few pointer copies, not a huge 20MB copy. Have you profiled your application and found this to be the bottleneck? Cheers, Edward

Am Dienstag 06 April 2010 21:31:30 schrieb Travis Erdman:
In Real World Haskell, the authors state that "the attraction of a tree to a functional programmer is cheap modification." They go on to explain that in a tree of say 10,000 nodes, approx 9,985 would be reused (on average) when updating a single node (rather than having to copy all 10k nodes, like a Haskell list would have to do).
In my application, I have a collection of (only) about 200 objects. So from the surface, not large at all. However, each object in the collection is a large tree that might be up to 20 MB in size. The application pulls objects one at a time from the collection, performs some processing, and delivers a new, updated object that needs to be inserted in the collection for future processing (the original object is now obsolete). I'm guessing a Data.IntMap might involve copying over about 8 objects for each single object update, which while better than copying all 200, is still not perfect.
What would be the most efficient Haskelly way to maintain my collection?
I'm wondering also if there is some sort of "pointer list" I could use. I think recreating a list of 200 pointers (199 of which are unchanged) might be a pretty efficient solution, if it exists.
Sadly, I've tried to understand Mutable Arrays (of a few different flavors), but all the monadic stuff is over my head at the moment, and I couldn't find sufficient documentation/tutorial that I could understand to implement them.
An immutable array should be fine for this. When you modify one tree, you create a new array of 200 pointers to trees, 199 of them are just copied from the old array, one points to your new modified tree. Of course, with STArrays, there would be no need to copy any pointers and allocate new arrays, so in principle it should be even more efficient, but boxed mutable arrays and the garbage collector don't play too well together (http://hackage.haskell.org/trac/ghc/ticket/650), so it might not be better.

Am Dienstag 06 April 2010 21:51:47 schrieb Daniel Fischer:
An immutable array should be fine for this. When you modify one tree, you create a new array of 200 pointers to trees, 199 of them are just copied from the old array, one points to your new modified tree.
Of course, with STArrays, there would be no need to copy any pointers and allocate new arrays, so in principle it should be even more efficient, but boxed mutable arrays and the garbage collector don't play too well together (http://hackage.haskell.org/trac/ghc/ticket/650), so it might not be better.
Forgot to check before posting: it's fixed now, the fix will be in 6.12.2. And what Edward said, only pointers will be copied anyway, so a Map can well be better than an array, depends on your access patterns.

Am 06.04.10 21:31, Travis Erdman wrote:
In Real World Haskell, the authors state that "the attraction of a tree to a functional programmer is cheap modification." They go on to explain that in a tree of say 10,000 nodes, approx 9,985 would be reused (on average) when updating a single node (rather than having to copy all 10k nodes, like a Haskell list would have to do).
In my application, I have a collection of (only) about 200 objects. So from the surface, not large at all. However, each object in the collection is a large tree that might be up to 20 MB in size. The application pulls objects one at a time from the collection, performs some processing, and delivers a new, updated object that needs to be inserted in the collection for future processing (the original object is now obsolete). I'm guessing a Data.IntMap might involve copying over about 8 objects for each single object update, which while better than copying all 200, is still not perfect.
The same reasoning that applies to trees also applies to the objects. The only thing that will be copied are about log(200) *pointers* to your objects; the 20MB objects themselves will never be copied. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com
participants (4)
-
Daniel Fischer
-
Edward Z. Yang
-
Heinrich Apfelmus
-
Travis Erdman