Structural sharing in haskell data structures?

newer
Recommending "Design concepts in...

Andrew Wagner

12 May 2009 12 May '09

4:54 p.m.

So I've been reading a lot about a (relatively) new language called Clojure. One of its goals is to make concurrency easier via a built-in home-grown STM. Anyway, one of the ways it tries to do this is to have completely immutable data structures. Every time I read a tutorial about this in Clojure, it says "...yes, it sounds awful to think that your whole data structure gets copied every time you want to make a change to it, but it's sane because of a technique called structural sharing". Yet every time I hear immutability talked about in Haskell, what I hear is "...yes, it sounds awful to think that your whole data structure gets copied every time you want to make a change to it, but it's sane because of laziness...unless you need the whole data structure...". So I'm just curious, does GHC use structural sharing or something similar? Do other implementations? Does it matter?

Attachments:

attachment.html (text/html — 925 bytes)

Show replies by date

Don Stewart

12 May 12 May

9:33 p.m.

wagner.andrew:

...

So I've been reading a lot about a (relatively) new language called Clojure. One of its goals is to make concurrency easier via a built-in home-grown STM. Anyway, one of the ways it tries to do this is to have completely immutable data structures. Every time I read a tutorial about this in Clojure, it says "...yes, it sounds awful to think that your whole data structure gets copied every time you want to make a change to it, but it's sane because of a technique called structural sharing". Yet every time I hear immutability talked about in Haskell, what I hear is "...yes, it sounds awful to think that your whole data structure gets copied every time you want to make a change to it, but it's sane because of laziness...unless you need the whole data structure...". So I'm just curious, does GHC use structural sharing or something similar? Do other implementations? Does it matter?

Purity allows our data structures to have a lot of sharing. This is separate to laziness. Laziness lets us build up interesting structures that have unusual sharing. Actually, what kind of persistant structures does Clojure have at this stage? I was under the impression they were reusing Java data structures. E.g. some of the nicer ones on hackage are zippers, patricia tries, finger trees, which I can't imaging have been ported. -- Don

Andrew Wagner

10:02 p.m.

...

Purity allows our data structures to have a lot of sharing. This is separate to laziness.

Ah, so haskell does do it. Interesting that it so rarely comes up, whereas it's frequently mentioned in clojure.

...

Laziness lets us build up interesting structures that have unusual sharing.

Actually, what kind of persistant structures does Clojure have at this stage? I was under the impression they were reusing Java data structures. E.g. some of the nicer ones on hackage are zippers, patricia tries, finger trees, which I can't imaging have been ported.

It has some built-in persistent data structures: lists, vectors (arrays), maps, and sets. It also has strong interoperability with Java, so that any existing Java library can easily be used in Clojure code. In some ways, that makes it a VERY mature language already.

...

-- Don

Don Stewart

10:06 p.m.

wagner.andrew:

...

Purity allows our data structures to have a lot of sharing. This is separate to laziness.

Ah, so haskell does do it. Interesting that it so rarely comes up, whereas it's frequently mentioned in clojure.

I think it is just assumed, since that's been the case for 20 years or more now. Sharing is kind of exciting to the ex-Java people looking at Clojure, I guess, since it's a new idea. So they talk about it.

...

Laziness lets us build up interesting structures that have unusual sharing.

Actually, what kind of persistant structures does Clojure have at this stage? I was under the impression they were reusing Java data structures. E.g. some of the nicer ones on hackage are zippers, patricia tries, finger trees, which I can't imaging have been ported.

It has some built-in persistent data structures: lists, vectors (arrays), maps, and sets. It also has strong interoperability with Java, so that any existing Java library can easily be used in Clojure code. In some ways, that makes it a VERY mature language already.

Certainly the JVM and its libraries are mature. Looks like yet another example of tech incubation in Haskell, then dispersal outwards to other langs. The more the better. -- Don

wren ng thornton

13 May 13 May

4:19 a.m.

Don Stewart wrote:

...

wagner.andrew:

...
...
Purity allows our data structures to have a lot of sharing. This is separate to laziness.

Ah, so haskell does do it. Interesting that it so rarely comes up, whereas it's frequently mentioned in clojure.

I think it is just assumed, since that's been the case for 20 years or more now. Sharing is kind of exciting to the ex-Java people looking at Clojure, I guess, since it's a new idea. So they talk about it.

So far as I know, all immutable languages do it. Like Don, I think the reason it's emphasized so much in Clojure is because of trying to convert the Java crowd and introducing them to immutability. With mutable languages you have to clone _everything_ for fear of later changes affecting earlier copies. Because of this, folks used to mutable languages often erroneously suppose that immutable languages do the same thing. There's a lot of FUD in the mainstream about declarative/immutable/functional languages and their performance behavior. The idea is simple and you can do it in mutable languages perfectly well (clone the changed part of the spine, point to old substructure) so long as you're careful not to do mutation, e.g. by marking everything "final". You don't see it as much in mutable languages because people get squeamish about the first step: clone the changed spine; they'd much rather mutate it. In heavily GCed languages like Haskell allocation and collection is cheap, so we don't mind too much; but in Java and the like, both allocation and collection are expensive so the idea of cheap throwaway objects is foreign. -- Live well, ~wren

Jan-Willem Maessen

11:43 a.m.

I wanted to clear up one misconception here... On May 13, 2009, at 12:19 AM, wren ng thornton wrote:

...

In heavily GCed languages like Haskell allocation and collection is cheap, so we don't mind too much; but in Java and the like, both allocation and collection are expensive so the idea of cheap throwaway objects is foreign.

Not true! If you look at the internals of Clojure, you'll discover they're using trees with *very* wide fanout (eg fanout-64 leaf trees for lists). Why? Because it's so cheap to allocate and GC these structures! By using shallow-but-wide trees we reduce the cost of indexing and accessing list elements. I suspect you'd still be hard- pressed to support this kind of allocation behavior in any of the present Haskell implementations, and Haskell implementations of the same kinds of structures have limited fanout to 2-4 elements or so. Now, if only the Clojure structures supported efficient concatenation... -Jan-Willem Maessen

...

-- Live well, ~wren _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

wren ng thornton

10:58 p.m.

Jan-Willem Maessen wrote:

...

I wanted to clear up one misconception here...

wren ng thornton wrote:

...
In heavily GCed languages like Haskell allocation and collection is cheap, so we don't mind too much; but in Java and the like, both allocation and collection are expensive so the idea of cheap throwaway objects is foreign.

Not true!

I was speaking of Java, not Clojure. I believe the costs in Java are well documented, though I don't know enough about the JVM to know where the blame belongs. (All I know of Clojure is that it's a Lisp-like on the JVM :)

...

If you look at the internals of Clojure, you'll discover they're using trees with *very* wide fanout (eg fanout-64 leaf trees for lists). Why? Because it's so cheap to allocate and GC these structures! By using shallow-but-wide trees we reduce the cost of indexing and accessing list elements. I suspect you'd still be hard-pressed to support this kind of allocation behavior in any of the present Haskell implementations, and Haskell implementations of the same kinds of structures have limited fanout to 2-4 elements or so.

I was under the impression that the reason datastructures in Haskell tend to be limited to 4-fanout had more to do with the cleanliness of the implementations--- pattern matching on 64-wide cells is quite ugly, as is dealing with the proliferation of corner cases for complex structures like finger trees, patricia trees, etc. The use of view patterns could clean this up significantly. On the other hand, we do have things like lazy ByteStrings and UVector which do have wide fanouts. -- Live well, ~wren

Don Stewart

10:58 p.m.

wren:

...

Jan-Willem Maessen wrote:

...
I wanted to clear up one misconception here...

wren ng thornton wrote:

...
In heavily GCed languages like Haskell allocation and collection is cheap, so we don't mind too much; but in Java and the like, both > allocation and collection are expensive so the idea of cheap throwaway objects is foreign.

Not true!

I was speaking of Java, not Clojure. I believe the costs in Java are well documented, though I don't know enough about the JVM to know where the blame belongs. (All I know of Clojure is that it's a Lisp-like on the JVM :)

...
If you look at the internals of Clojure, you'll discover they're using trees with *very* wide fanout (eg fanout-64 leaf trees for lists). Why? Because it's so cheap to allocate and GC these structures! By using shallow-but-wide trees we reduce the cost of indexing and accessing list elements. I suspect you'd still be hard-pressed to support this kind of allocation behavior in any of the present Haskell implementations, and Haskell implementations of the same kinds of structures have limited fanout to 2-4 elements or so.

I was under the impression that the reason datastructures in Haskell tend to be limited to 4-fanout had more to do with the cleanliness of the implementations--- pattern matching on 64-wide cells is quite ugly, as is dealing with the proliferation of corner cases for complex structures like finger trees, patricia trees, etc. The use of view patterns could clean this up significantly. On the other hand, we do have things like lazy ByteStrings and UVector which do have wide fanouts.

Yes, agreed. This is the first I've heard of a penalty for fan out. More details please! -- Don

Vladimir Ivanov

14 May 14 May

10:41 a.m.

Wren, It's not true for Java either. The idea of "cheap throw-away objects" isn't foreign for modern JVMs. GC favors short-lived immutable objects since: - all collectors are generational; - GC cycle duration is proportional to the size of live objects, not the heap size; - object allocation is _very_ cheap, even in parallel; Moreover, many advanced techniques are used in JVM GCs: - there are parallel & concurrent collectors; - adaptive ergonomics (e.g. heap resizing, according to some goals on collection duration); There are problems with functional languages on top of JVM, but GC isn't among them. If you are interested in current work, you may look at mlvm project [1]. -- vi [1] http://openjdk.java.net/projects/mlvm/ On Thu, May 14, 2009 at 2:58 AM, wren ng thornton wrote:

...

Jan-Willem Maessen wrote:

...
I wanted to clear up one misconception here...

wren ng thornton wrote:

...
In heavily GCed languages like Haskell allocation and collection is > cheap, so we don't mind too much; but in Java and the like, both > allocation and collection are expensive so the idea of cheap throwaway > objects is foreign.

Not true!

I was speaking of Java, not Clojure. I believe the costs in Java are well documented, though I don't know enough about the JVM to know where the blame belongs. (All I know of Clojure is that it's a Lisp-like on the JVM :)

...
If you look at the internals of Clojure, you'll discover they're using trees with *very* wide fanout (eg fanout-64 leaf trees for lists). Why? Because it's so cheap to allocate and GC these structures! By using shallow-but-wide trees we reduce the cost of indexing and accessing list elements. I suspect you'd still be hard-pressed to support this kind of allocation behavior in any of the present Haskell implementations, and Haskell implementations of the same kinds of structures have limited fanout to 2-4 elements or so.

I was under the impression that the reason datastructures in Haskell tend to be limited to 4-fanout had more to do with the cleanliness of the implementations--- pattern matching on 64-wide cells is quite ugly, as is dealing with the proliferation of corner cases for complex structures like finger trees, patricia trees, etc. The use of view patterns could clean this up significantly. On the other hand, we do have things like lazy ByteStrings and UVector which do have wide fanouts.

-- Live well, ~wren _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Jan-Willem Maessen

1:03 p.m.

On May 13, 2009, at 6:58 PM, wren ng thornton wrote:

...

Jan-Willem Maessen wrote:

...
...
In heavily GCed languages like Haskell allocation and collection is > cheap, so we don't mind too much; but in Java and the like, both > allocation and collection are expensive so the idea of cheap

I wanted to clear up one misconception here... wren ng thornton wrote: throwaway > objects is foreign. Not true!

I was speaking of Java, not Clojure. I believe the costs in Java are well documented, though I don't know enough about the JVM to know where the blame belongs. (All I know of Clojure is that it's a Lisp- like on the JVM :)

I think you're missing the point here: the code I refer to below *is in Java* and is running on a standard JVM; the "costs" you refer to simply don't exist! As Vladimir Ivanov points out, and as Rich Hickey is happy to observe in his talks on Clojure, the JVM handles allocation-intensive garbage-intensive programs very well.

...

...
If you look at the internals of Clojure, you'll discover they're using trees with *very* wide fanout (eg fanout-64 leaf trees for lists). Why? Because it's so cheap to allocate and GC these structures! By using shallow-but-wide trees we reduce the cost of indexing and accessing list elements. I suspect you'd still be hard-pressed to support this kind of allocation behavior in any of the present Haskell implementations, and Haskell implementations of the same kinds of structures have limited fanout to 2-4 elements or so.

I was under the impression that the reason datastructures in Haskell tend to be limited to 4-fanout had more to do with the cleanliness of the implementations--- pattern matching on 64-wide cells is quite ugly, as is dealing with the proliferation of corner cases for complex structures like finger trees, patricia trees, etc. The use of view patterns could clean this up significantly. On the other hand, we do have things like lazy ByteStrings and UVector which do have wide fanouts.

Hmm, I think neither of the data structures you name actually support both O(lg n) indexing and O(lg n) cons or append. That said, your point is well taken, so let's instead state it as a challenge: Can you, oh Haskellers, implement a fast, wide-fanout (say >= 8) tree- based sequence implementation in Haskell, which supports at-least-log- time indexing and at-least-log-time cons with a large base for the logarithm? Can you do it without turning off array bounds checking (either by using unsafe operations or low-level peeking and poking) and without using an algebraic data type with O(f) constructors for fanout of f? You can turn off bounds checks if your program encodes static guarantees that indices cannot be out of bounds (there are a couple of libraries to do this). The spirit here is "Work in Haskell with safe operations and no FFI except through safe libraries, but otherwise use any extensions you like." I actually think this *is* doable, but it touches a few areas where Haskell doesn't presently do well (the bounds checking in particular is a challenge). I threw in the bounds checking when I realized that in fact the equivalent Java code is always bounds checked, and these bounds checks are then optimized away where possible. Actually, I'd *love* to see an *in*efficient solution to eliminating as many bounds checks as possible! -Jan

...

-- Live well, ~wren _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Duncan Coutts

2:17 p.m.

On Thu, 2009-05-14 at 09:03 -0400, Jan-Willem Maessen wrote:

...

Hmm, I think neither of the data structures you name actually support both O(lg n) indexing and O(lg n) cons or append. That said, your point is well taken, so let's instead state it as a challenge:

Can you, oh Haskellers, implement a fast, wide-fanout (say >= 8) tree- based sequence implementation in Haskell, which supports at-least-log- time indexing and at-least-log-time cons with a large base for the logarithm? Can you do it without turning off array bounds checking (either by using unsafe operations or low-level peeking and poking) and without using an algebraic data type with O(f) constructors for fanout of f? You can turn off bounds checks if your program encodes static guarantees that indices cannot be out of bounds (there are a couple of libraries to do this).

Can we motivate the restriction of not using multiple constructors? If we're only talking about a fanout of 8 then it doesn't look like a problem. It sounds like you're really asking for an array but without wanting to say so explicitly. Perhaps you should ask for a variable fanout or a fanout of something bigger like 32 (and presumably these requirements could be justified too?). Duncan

Jan-Willem Maessen

2:30 p.m.

On May 14, 2009, at 10:17 AM, Duncan Coutts wrote:

...

On Thu, 2009-05-14 at 09:03 -0400, Jan-Willem Maessen wrote:

...
Hmm, I think neither of the data structures you name actually support both O(lg n) indexing and O(lg n) cons or append. That said, your point is well taken, so let's instead state it as a challenge:

Can you, oh Haskellers, implement a fast, wide-fanout (say >= 8) tree- based sequence implementation in Haskell, which supports at-least- log- time indexing and at-least-log-time cons with a large base for the logarithm? Can you do it without turning off array bounds checking (either by using unsafe operations or low-level peeking and poking) and without using an algebraic data type with O(f) constructors for fanout of f? You can turn off bounds checks if your program encodes static guarantees that indices cannot be out of bounds (there are a couple of libraries to do this).

Can we motivate the restriction of not using multiple constructors? If we're only talking about a fanout of 8 then it doesn't look like a problem.

I actually expect this will cause some fairly nasty code bloat, but I'm happy to be proven wrong. :-)

...

It sounds like you're really asking for an array but without wanting to say so explicitly. Perhaps you should ask for a variable fanout or a fanout of something bigger like 32 (and presumably these requirements could be justified too?).

Wide fanout seems fair. -Jan

...

Duncan

Dan Doel

3:01 p.m.

On Thursday 14 May 2009 9:03:30 am Jan-Willem Maessen wrote:

...

Hmm, I think neither of the data structures you name actually support both O(lg n) indexing and O(lg n) cons or append. That said, your point is well taken, so let's instead state it as a challenge:

Data.Sequence has O(log n) index, concatenation, update, take, drop and splitAt, and O(1) cons, snoc, and viewing at both ends, according to the documentation. -- Dan

Jan-Willem Maessen

3:21 p.m.

On May 14, 2009, at 11:01 AM, Dan Doel wrote:

...

On Thursday 14 May 2009 9:03:30 am Jan-Willem Maessen wrote:

...
Hmm, I think neither of the data structures you name actually support both O(lg n) indexing and O(lg n) cons or append. That said, your point is well taken, so let's instead state it as a challenge:

Data.Sequence has O(log n) index, concatenation, update, take, drop and splitAt, and O(1) cons, snoc, and viewing at both ends, according to the documentation.

Yes. But large sequences end up being quite deep. Can a wide-fanout version be made that is actually faster? Note that the effective fanout of Hinze's finger trees is approximately e; consider effective fanouts of e^2 to e^4 (which may require substantially higher maximum fanout). -Jan

...

-- Dan

Don Stewart

5:18 p.m.

jmaessen:

...

On May 14, 2009, at 11:01 AM, Dan Doel wrote:

...
On Thursday 14 May 2009 9:03:30 am Jan-Willem Maessen wrote:

...
Hmm, I think neither of the data structures you name actually support both O(lg n) indexing and O(lg n) cons or append. That said, your point is well taken, so let's instead state it as a challenge:

Data.Sequence has O(log n) index, concatenation, update, take, drop and splitAt, and O(1) cons, snoc, and viewing at both ends, according to the documentation.

Yes. But large sequences end up being quite deep. Can a wide-fanout version be made that is actually faster? Note that the effective fanout of Hinze's finger trees is approximately e; consider effective fanouts of e^2 to e^4 (which may require substantially higher maximum fanout).

Ah, so I see what Jan's asking for (I think). During the hackathon a number of us were experimenting with associated types to flatten groups of nodes in Data.Map (by unpacking them into the constructor). We were playing with fan out of up to 16 (while preserving pattern matching that makes it look like the original constructors). Good speedups followed, thanks to the improved data density. Here's a video (first part is about visualizing heap structures): http://vimeo.com/4258084 -- Don

Andrew Wagner

13 May 13 May

12:58 p.m.

...

So far as I know, all immutable languages do it. Like Don, I think the reason it's emphasized so much in Clojure is because of trying to convert the Java crowd and introducing them to immutability. With mutable languages you have to clone _everything_ for fear of later changes affecting earlier copies. Because of this, folks used to mutable languages often erroneously suppose that immutable languages do the same thing. There's a lot of FUD in the mainstream about declarative/immutable/functional languages and their performance behavior.

The idea is simple and you can do it in mutable languages perfectly well (clone the changed part of the spine, point to old substructure) so long as you're careful not to do mutation, e.g. by marking everything "final". You don't see it as much in mutable languages because people get squeamish about the first step: clone the changed spine; they'd much rather mutate it. In heavily GCed languages like Haskell allocation and collection is cheap, so we don't mind too much; but in Java and the like, both allocation and collection are expensive so the idea of cheap throwaway objects is foreign.

Are you guys saying that the Clojure target audience is current Java programmers, whereas Haskell's is people who already do functional, immutable programming? Because how long the technique has been used, or how widespread it's used, is irrelevant when trying to convince someone who doesn't have experience with immutable data structures that it's not completely insane.

Tillmann Rendel

12 May 12 May

11:25 p.m.

Hi, Andrew Wagner wrote:

...

So I'm just curious, does GHC use structural sharing or something similar?

Structural sharing is not a feature of implementations, but of libraries. Consider this example: -- a function to "change" the head of a list replaceHead y xs = y : tail xs -- a big list xs = [1..10000] -- two new list with changed head ys = replaceHead 42 xs zs = replaceHead 27 xs -- the length of our lists n = length xs + length ys + length zs In this example, n will be 30000, but even after evaluation xs, ys and zs, we have only 10002 cons cells allocated, because 9999 cons cells are shared between xs, ys and zs. This happens automatically in every language with references or pointers. However, it is only sane to do with immutable data structures, so programmers have to add extra code to explicitly avoid structural sharing in impure languages. Another example: xs = 1 : xs This list is infinite, but we have only one cons cell allocated. Tillmann

5892

Age (days ago)

5894

Last active (days ago)

List overview

Download

16 comments

8 participants

participants (8)

Andrew Wagner
Dan Doel
Don Stewart
Duncan Coutts
Jan-Willem Maessen
Tillmann Rendel
Vladimir Ivanov
wren ng thornton