
Hi, I am working on improving containers performance and have written a draft paper about it http://fox.ucw.cz/papers/containers/containers.pdf I wanted to start a new thread in response to strictness issues brought up by Claus Reinke.
- IntMap and Map expose different APIs wrt strictness http://www.haskell.org/pipermail/libraries/2009-March/011399.html http://www.mail-archive.com/haskell-cafe@haskell.org/msg48925.html
- to work around the non-strict Map fold, users have to resort to nested folds (non-strict fold to convert to list, then strict foldl' over list) and other tricks: http://www.haskell.org/pipermail/haskell-cafe/2010-June/079098.html
similarly non-trivial thinking is required to work around other no-control-over-strictness issues in the APIs, bypassing the obvious functions and using non-obvious ones based on knowledge about their implementation
- strictness is used inconsistently: http://hackage.haskell.org/trac/ghc/ticket/4109
- strictness is not documented: !!! search for "strict" in the Data.Map/IntMap documentation..
this is really bad - the only way for users to become aware of strictness problems is by running into them, and the only way to fix strictness-related performance issues is by referring to the implementation (in theory, an implementation change could invalidate lots of performance fixes in production code)
I need some opinion: - Do you think methods like insert/lookup/delete/etc should be strict in key/element? As Claus wrote, right now it is undocumented and inconsistent (both in the methods of one container and also in the same methods of different container). The problem is the following: a lookup on an empty container does not really have to evaluate the key. Should we honour it or should we sacrifice it to be faster and consistent with other methods (eg., insert has to be strict in the key). - The Set/Map, IntSet/IntMap do not currently have a strict fold. Do you think it should be added to the API? Cheers, Milan

On Thu, Jun 24, 2010 at 12:14 PM, Milan Straka
Hi,
I am working on improving containers performance and have written a draft paper about it http://fox.ucw.cz/papers/containers/containers.pdf
I wanted to start a new thread in response to strictness issues brought up by Claus Reinke.
- IntMap and Map expose different APIs wrt strictness http://www.haskell.org/pipermail/libraries/2009-March/011399.html http://www.mail-archive.com/haskell-cafe@haskell.org/msg48925.html
- to work around the non-strict Map fold, users have to resort to nested folds (non-strict fold to convert to list, then strict foldl' over list) and other tricks: http://www.haskell.org/pipermail/haskell-cafe/2010-June/079098.html
similarly non-trivial thinking is required to work around other no-control-over-strictness issues in the APIs, bypassing the obvious functions and using non-obvious ones based on knowledge about their implementation
- strictness is used inconsistently: http://hackage.haskell.org/trac/ghc/ticket/4109
- strictness is not documented: !!! search for "strict" in the Data.Map/IntMap documentation..
this is really bad - the only way for users to become aware of strictness problems is by running into them, and the only way to fix strictness-related performance issues is by referring to the implementation (in theory, an implementation change could invalidate lots of performance fixes in production code)
I need some opinion:
- Do you think methods like insert/lookup/delete/etc should be strict in key/element?
Yes. This allows us to use associated data types to unpack keys and values directly in the constructor saving both time (as accessing keys require fewer indirections) and space (as there's less overhead per key/value pair).
As Claus wrote, right now it is undocumented and inconsistent (both in the methods of one container and also in the same methods of different container).
The problem is the following: a lookup on an empty container does not really have to evaluate the key. Should we honour it or should we sacrifice it to be faster and consistent with other methods (eg., insert has to be strict in the key).
- The Set/Map, IntSet/IntMap do not currently have a strict fold. Do you think it should be added to the API?
Yes. I believe the recent post on storing lots amount of Twitter data in a Map ran into problems due to lack of a strict fold. I think it's important to consider all the container types at once and find a consistent set of functions and strictness guarantees so that we don't introduce more inconsistencies in their APIs. Johan

On 24 June 2010 11:14, Milan Straka
I need some opinion:
- Do you think methods like insert/lookup/delete/etc should be strict in key/element?
As Claus wrote, right now it is undocumented and inconsistent (both in the methods of one container and also in the same methods of different container).
Just as it is sometimes important to be able to do strict inserts, it is important sometimes that we have maps that are lazy in the elements. There are important use cases both ways. So yes we should have some kind of consistent convention. We could do worse than the naming convention where the strict versions use a trailing prime ' character. Duncan

On 24 June 2010 11:14, Milan Straka
wrote: I need some opinion:
- Do you think methods like insert/lookup/delete/etc should be strict in key/element?
As Claus wrote, right now it is undocumented and inconsistent (both in the methods of one container and also in the same methods of different container).
Just as it is sometimes important to be able to do strict inserts, it is important sometimes that we have maps that are lazy in the elements. There are important use cases both ways.
So yes we should have some kind of consistent convention. We could do worse than the naming convention where the strict versions use a trailing prime ' character.
I thought we are talking only about keys/elements. I would leave the values untouched. Personally I vote for: - keys in Maps and elements in Sets are strict - vales in Maps are left untouched (lazy) Cheers, Milan

2010/6/24 Milan Straka
On 24 June 2010 11:14, Milan Straka
wrote: I need some opinion:
- Do you think methods like insert/lookup/delete/etc should be strict in key/element?
As Claus wrote, right now it is undocumented and inconsistent (both in the methods of one container and also in the same methods of different container).
Just as it is sometimes important to be able to do strict inserts, it is important sometimes that we have maps that are lazy in the elements. There are important use cases both ways.
So yes we should have some kind of consistent convention. We could do worse than the naming convention where the strict versions use a trailing prime ' character.
I thought we are talking only about keys/elements. I would leave the values untouched.
Personally I vote for: - keys in Maps and elements in Sets are strict - vales in Maps are left untouched (lazy)
+1 from me Great work so far. -Edward Kmett

On Thu, Jun 24, 2010 at 1:59 PM, Edward Kmett
2010/6/24 Milan Straka
On 24 June 2010 11:14, Milan Straka
wrote: I need some opinion:
- Do you think methods like insert/lookup/delete/etc should be strict
in
key/element?
As Claus wrote, right now it is undocumented and inconsistent (both in the methods of one container and also in the same methods of different container).
Just as it is sometimes important to be able to do strict inserts, it is important sometimes that we have maps that are lazy in the elements. There are important use cases both ways.
So yes we should have some kind of consistent convention. We could do worse than the naming convention where the strict versions use a trailing prime ' character.
I thought we are talking only about keys/elements. I would leave the values untouched.
Personally I vote for: - keys in Maps and elements in Sets are strict - vales in Maps are left untouched (lazy)
+1 from me
Great work so far.
The space overhead per key/value pair is 6 words (48 bytes on a 64-bit architecture) when using lazy values but only 4 words (32 bytes) per key/value pair when using strict (unpacked) values, a 50% difference. This really starts to matter with big enough data sets (as seen in the recent Twitter analysis thread). When work with Big Data it's often desirable to fit as much data in RAM as possible as the result of many algorithms (think machine learning or search ranking) differs with the amount of data you can hold in memory. Something to consider. Johan

On Thu, Jun 24, 2010 at 8:27 AM, Johan Tibell
On Thu, Jun 24, 2010 at 1:59 PM, Edward Kmett
wrote: 2010/6/24 Milan Straka
On 24 June 2010 11:14, Milan Straka
wrote: I need some opinion:
- Do you think methods like insert/lookup/delete/etc should be strict
in
key/element?
As Claus wrote, right now it is undocumented and inconsistent (both in the methods of one container and also in the same methods of different container).
Just as it is sometimes important to be able to do strict inserts, it is important sometimes that we have maps that are lazy in the elements. There are important use cases both ways.
So yes we should have some kind of consistent convention. We could do worse than the naming convention where the strict versions use a trailing prime ' character.
I thought we are talking only about keys/elements. I would leave the values untouched.
Personally I vote for: - keys in Maps and elements in Sets are strict - vales in Maps are left untouched (lazy)
+1 from me
Great work so far.
The space overhead per key/value pair is 6 words (48 bytes on a 64-bit architecture) when using lazy values but only 4 words (32 bytes) per key/value pair when using strict (unpacked) values, a 50% difference. This really starts to matter with big enough data sets (as seen in the recent Twitter analysis thread). When work with Big Data it's often desirable to fit as much data in RAM as possible as the result of many algorithms (think machine learning or search ranking) differs with the amount of data you can hold in memory.
Something to consider.
I definitely agree that unboxing can help a great deal with performance and space utilization. However, as containers does not currently require any exotic extensions, I think that perhaps a type family -based generic map would belong in another 'unboxed-containers' or 'adaptive-containers' package (both of which currently exist on hackage), as it dramatically extends the language extension footprint of containers, taking it from something that easily runs across a wide array of Haskell implementations to something very ghc-specific. -Edward Kmett

On Thu, Jun 24, 2010 at 3:28 PM, Edward Kmett
I definitely agree that unboxing can help a great deal with performance and space utilization.
However, as containers does not currently require any exotic extensions, I think that perhaps a type family -based generic map would belong in another 'unboxed-containers' or 'adaptive-containers' package (both of which currently exist on hackage), as it dramatically extends the language extension footprint of containers, taking it from something that easily runs across a wide array of Haskell implementations to something very ghc-specific.
Good point. I plan to look into such a library this summer. Any non-strictness related API improvements to the containers package (e.g. strict folds) should carry over to a new, adaptable containers package. -- Johan

On Thu, Jun 24, 2010 at 09:28:15AM -0400, Edward Kmett wrote:
On Thu, Jun 24, 2010 at 8:27 AM, Johan Tibell
wrote: The space overhead per key/value pair is 6 words (48 bytes on a 64-bit architecture) when using lazy values but only 4 words (32 bytes) per key/value pair when using strict (unpacked) values, a 50% difference. This really starts to matter with big enough data sets (as seen in the recent Twitter analysis thread). When work with Big Data it's often desirable to fit as much data in RAM as possible as the result of many algorithms (think machine learning or search ranking) differs with the amount of data you can hold in memory.
Something to consider.
I definitely agree that unboxing can help a great deal with performance and space utilization.
However, as containers does not currently require any exotic extensions, I think that perhaps a type family -based generic map would belong in another 'unboxed-containers' or 'adaptive-containers' package (both of which currently exist on hackage), as it dramatically extends the language extension footprint of containers, taking it from something that easily runs across a wide array of Haskell implementations to something very ghc-specific.
If Map was strict in keys and values, we could have data Unstrict a = U a to unstrictify the values. I don't know if this a good solution, though :). Cheers, -- Felipe.

As Claus wrote, right now it is undocumented and inconsistent (both in the methods of one container and also in the same methods of different container).
Just as it is sometimes important to be able to do strict inserts, it is important sometimes that we have maps that are lazy in the elements. There are important use cases both ways.
I thought we are talking only about keys/elements. I would leave the values untouched.
There are problems at several levels: 1 it needs to be documented/specified to what extent one can rely on strictness; usually, Maps are - strict in keys and structure (so forcing the Map forces all keys), - not strict in values 2 it needs to be documented to what extent operations preserve or enforce strictness; 3 when the default strictness does not match the needs of the application, the library needs to provide sufficient hooks for enforcing different strictness policies (there might be a relation to the recent strategies work, btw); All three points are currently undocumented, though 1 is assumed. 2 and 3 are terribly inconsistent at the moment: not only are there missing strict-in-values versions of most operations, and some versions only in Map, not in IntMap; the real problems start when the existing operations simply do not provide the means to enforce different strictness policies. Things one wants to be able to do include: - pass strict operations to higher-order API functions and have the strictness information propagated through the strictness analyser - tie value evaluation to key availability (since the Map is strict in the latter, it would then become strict in the former, too) - preserve/push strictness policies through all API operations (don't have foldl's accumulating Maps blow up because of non-strictifiable Map combination operators; don't have inserts sometimes evaluate values, sometime not; don't have Map traversal operations in the API that give no handle for adding strictness) As Duncan points out, both strict and non-strict versions have their uses. Non-strict in values is a useful default, IMHO, but one needs to be able to introduce strictness where required (some have argued for strict in values as default, with a data type indirection to re-introduce non-strictness). The blowups I've seen usually involve values not being evaluated through Map/IntMap operations (insert, fold, union, ..). But as with foldr/foldl/fold', that does not mean that everything should be strict, just that users need to be able to specify as much or as little strictness as is appropriate for their application. Claus

fox:
Hi,
I am working on improving containers performance and have written a draft paper about it http://fox.ucw.cz/papers/containers/containers.pdf
I think we need the patches as soon as possible, to help out the Twitter guys, who are doing some processing with IntMap at the moment. Can you make the current patches public this weekend, perhaps? -- Don

Hi,
fox:
Hi,
I am working on improving containers performance and have written a draft paper about it http://fox.ucw.cz/papers/containers/containers.pdf
I think we need the patches as soon as possible, to help out the Twitter guys, who are doing some processing with IntMap at the moment.
Can you make the current patches public this weekend, perhaps?
I am really sorry, I did not get to email sooner. I will also not be reading mail from Friday to Sunday. I will publish my current patches tomorrow, but they are by no means complete and I want to improve them further. Cheers, Milan

Milan Straka wrote:
I am working on improving containers performance and have written a draft paper about it http://fox.ucw.cz/papers/containers/containers.pdf
I am a bit surprised about the types chosen for HashSet and HashMap (section 5, p.9). Usually, hash-based containers are optimized for the case that the hash-buckets remain small. For that case, it's hard to imagine that the given types wouldn't be slower than the more straightforward types: data HashSet elem = HS (IntMap [elem]) data HashMap key val = HM (IntMap [(key, val)]) Others have suggested further improvements by using strict (unboxed?) tuples, a variant on lists that is strict and/or non-empty. Thanks, Yitz

Hi,
Milan Straka wrote:
I am working on improving containers performance and have written a draft paper about it http://fox.ucw.cz/papers/containers/containers.pdf
I am a bit surprised about the types chosen for HashSet and HashMap (section 5, p.9). Usually, hash-based containers are optimized for the case that the hash-buckets remain small. For that case, it's hard to imagine that the given types wouldn't be slower than the more straightforward types:
data HashSet elem = HS (IntMap [elem]) data HashMap key val = HM (IntMap [(key, val)])
Others have suggested further improvements by using strict (unboxed?) tuples, a variant on lists that is strict and/or non-empty.
I agree that my choice was not good enough. Johan Tibell also commented on this. After suggestions by others I am thinking about data Some elem = Single {-# UNPACK #-} !elem | More (Set elem) or data Some elem = Single {-# UNPACK #-} !elem | More {-# UNPACK #-} !elem (Some elem) I will improve the final version of the paper and also the implementation. Cheers, Milan

On Thu, Jul 1, 2010 at 12:03 AM, Milan Straka
Hi,
Milan Straka wrote:
I am working on improving containers performance and have written a draft paper about it http://fox.ucw.cz/papers/containers/containers.pdf
I am a bit surprised about the types chosen for HashSet and HashMap (section 5, p.9). Usually, hash-based containers are optimized for the case that the hash-buckets remain small. For that case, it's hard to imagine that the given types wouldn't be slower than the more straightforward types:
data HashSet elem = HS (IntMap [elem]) data HashMap key val = HM (IntMap [(key, val)])
Others have suggested further improvements by using strict (unboxed?) tuples, a variant on lists that is strict and/or non-empty.
I agree that my choice was not good enough. Johan Tibell also commented on this.
After suggestions by others I am thinking about data Some elem = Single {-# UNPACK #-} !elem | More (Set elem) or data Some elem = Single {-# UNPACK #-} !elem | More {-# UNPACK #-} !elem (Some elem)
Unfortunately unpacking doesn't work for polymorphic fields (the new warning for ineffective unpack pragmas in GHC head should warn about this). Given this you might as well use a standard list. Johan

On Wed, Jun 30, 2010 at 8:02 PM, Johan Tibell
On Thu, Jul 1, 2010 at 12:03 AM, Milan Straka
wrote: After suggestions by others I am thinking about data Some elem = Single {-# UNPACK #-} !elem | More (Set elem) or data Some elem = Single {-# UNPACK #-} !elem | More {-# UNPACK #-} !elem (Some elem)
Unfortunately unpacking doesn't work for polymorphic fields (the new warning for ineffective unpack pragmas in GHC head should warn about this). Given this you might as well use a standard list.
However strictness information does work. But I don't know the answer for the following questions: - Should the elements be strict even while they are not unpacked? Performance gains? - Should the spine of the list be strict? Performance gains? Space leak gains? My guess, without any benchmarks, is that both elements and the spine should be strict. In any case, for the HashMap the following data type saves the indirection of the tuples: data SomePairs key val = Pair !key !val | MorePairs !key !val !(SomePairs key val) Cheers, -- Felipe.

On Thu, Jul 1, 2010 at 3:38 AM, Felipe Lessa
On Wed, Jun 30, 2010 at 8:02 PM, Johan Tibell
wrote: On Thu, Jul 1, 2010 at 12:03 AM, Milan Straka
wrote: After suggestions by others I am thinking about data Some elem = Single {-# UNPACK #-} !elem | More (Set elem) or data Some elem = Single {-# UNPACK #-} !elem | More {-# UNPACK #-} !elem (Some elem)
Unfortunately unpacking doesn't work for polymorphic fields (the new warning for ineffective unpack pragmas in GHC head should warn about this). Given this you might as well use a standard list.
However strictness information does work. But I don't know the answer for the following questions:
- Should the elements be strict even while they are not unpacked? Performance gains?
- Should the spine of the list be strict? Performance gains? Space leak gains?
In my experience/benchmarks strict spines are faster, probably due to better cache usage as the whole structure is updated at once. All the container data structures use strict spines. Johan

On Thu, Jul 1, 2010 at 11:54 AM, Johan Tibell
On Thu, Jul 1, 2010 at 3:38 AM, Felipe Lessa
wrote: On Wed, Jun 30, 2010 at 8:02 PM, Johan Tibell
wrote: On Thu, Jul 1, 2010 at 12:03 AM, Milan Straka
wrote: After suggestions by others I am thinking about data Some elem = Single {-# UNPACK #-} !elem | More (Set elem) or data Some elem = Single {-# UNPACK #-} !elem | More {-# UNPACK #-} !elem (Some elem)
Unfortunately unpacking doesn't work for polymorphic fields (the new warning for ineffective unpack pragmas in GHC head should warn about this). Given this you might as well use a standard list.
However strictness information does work. But I don't know the answer for the following questions:
- Should the elements be strict even while they are not unpacked? Performance gains?
For keys that are used by the structure, this can mean that during lookups, etc. the compiler can know that it can skip evaluation. But since you hand off to a user provided comparison function (even for the Eq/Ord typeclasses), the information that it can skip evaluation is often lost. - Should the spine of the list be strict? Performance gains? Space leak
gains?
In my experience/benchmarks strict spines are faster, probably due to better cache usage as the whole structure is updated at once.
That and the compiler can emit code that just checks tags and doesn't have to deal with forcing evaluation at all on recursion. All the container data structures use strict spines.
Almost all. =) Strict spines are a must, except in those few cases where you need to not be strict to retain appropriate asymptotics in a persistent environment (the deep node pointers in Data.Sequence.Seq for instance must be lazy or operations can be logarithmically more expensive when used persistently). In practice this applies to none of the other types in containers though. -Edward Kmett
participants (8)
-
Claus Reinke
-
Don Stewart
-
Duncan Coutts
-
Edward Kmett
-
Felipe Lessa
-
Johan Tibell
-
Milan Straka
-
Yitzchak Gale