Proposal: Partitionable goes somewhere + containers instances

<subject change>
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
Mike -- Neat, that's a cool library!
Edward -- ideally, where in the standard libraries should the
Partitionable comonoid go?
Btw, I'm not sure what the ideal return type for comappend is, given that
it needs to be able to "bottom out". Mike, our partition function's list
return type seems more reasonable. Or maybe something simple would be this:
*class Partitionable t where*
* partition :: t -> Maybe (t,t)*
That is, at some point its not worth splitting and returns Nothing, and
you'd better be able to deal with the 't' directly.
So what I really want is for the *containers package to please get some
kind of Partitionable instances! * Johan & others, I would be happy to
provide a patch if the class can be agreed on. This is important because
currently the balanced tree structure of Data.Set/Map is an *amazing and
beneficial property* that is *not* exposed at all through the API.
For example, it would be great to have a parallel traverse_ for Maps and
Sets in the Par monad. The particular impetus is that our new and enhanced
Par monad makes extensive use of Maps and Sets, both the pure, balanced
ones, and lockfree/inplace ones based on concurrent skip lists:
http://www.cs.indiana.edu/~rrnewton/haddock/lvish/
Alternatively, it would be ok if there were a "Data.Map.Internal" module
that exposed the Bin/Tip, but I assume people would rather have a clean
Partitionable instance...
Best,
-Ryan
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
The function called "parallel" in the HLearn library will automatically parallelize any homomorphism from a Partionable to a Monoid. I specifically use that to parallelize machine learning algorithms.
I have two thoughts for better abstractions:
1) This Partitionable class is essentially a comonoid. By reversing the arrows of mappend, we get:
comappend :: a -> (a,a)
By itself, this works well if the number of processors you have is a power of two, but it needs some more fanciness to get things balanced properly for other numbers of processors. I bet there's another algebraic structure that would capture these other cases, but I'm not sure what it is.
2) I'm working with parallelizing tree structures right now (kd-trees, cover trees, oct-trees, etc.). The real problem is not splitting the number of data points equally (this is easy), but splitting the amount of work equally. Some points take longer to process than others, and this cannot be determined in advance. Therefore, an equal split of the data points can result in one processor getting 25% of the work load, and the second processor getting 75%. Some sort of lazy Partitionable class that was aware of processor loads and didn't split data points until they were needed would be ideal for this scenario.
On Sat, Sep 28, 2013 at 6:46 PM, adam vogt
wrote: On Sat, Sep 28, 2013 at 1:09 PM, Ryan Newton
wrote: Hi all,
We all know and love Data.Foldable and are familiar with left folds and right folds. But what you want in a parallel program is a balanced fold over a tree. Fortunately, many of our datatypes (Sets, Maps) actually ARE balanced trees. Hmm, but how do we expose that?
Hi Ryan,
At least for Data.Map, the Foldable instance seems to have a reasonably balanced fold called fold (or foldMap):
fold t = go t where go (Bin _ _ v l r) = go l `mappend` (v `mappend` go r)
This doesn't seem to be guaranteed though. For example ghc's derived instance writes the foldr only, so fold would be right-associated for a:
data T a = B (T a) (T a) | L a deriving (Foldable)
Regards, Adam _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I'd think
partition :: t -> Either t (t, t)
might be more suited then...
Nicolas
On Sep 29, 2013 1:21 AM, "Ryan Newton"
<subject change>
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
Mike -- Neat, that's a cool library!
Edward -- ideally, where in the standard libraries should the Partitionable comonoid go?
Btw, I'm not sure what the ideal return type for comappend is, given that it needs to be able to "bottom out". Mike, our partition function's list return type seems more reasonable. Or maybe something simple would be this:
*class Partitionable t where* * partition :: t -> Maybe (t,t)*
That is, at some point its not worth splitting and returns Nothing, and you'd better be able to deal with the 't' directly.
So what I really want is for the *containers package to please get some kind of Partitionable instances! * Johan & others, I would be happy to provide a patch if the class can be agreed on. This is important because currently the balanced tree structure of Data.Set/Map is an *amazing and beneficial property* that is *not* exposed at all through the API. For example, it would be great to have a parallel traverse_ for Maps and Sets in the Par monad. The particular impetus is that our new and enhanced Par monad makes extensive use of Maps and Sets, both the pure, balanced ones, and lockfree/inplace ones based on concurrent skip lists:
http://www.cs.indiana.edu/~rrnewton/haddock/lvish/
Alternatively, it would be ok if there were a "Data.Map.Internal" module that exposed the Bin/Tip, but I assume people would rather have a clean Partitionable instance...
Best, -Ryan
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
The function called "parallel" in the HLearn library will automatically parallelize any homomorphism from a Partionable to a Monoid. I specifically use that to parallelize machine learning algorithms.
I have two thoughts for better abstractions:
1) This Partitionable class is essentially a comonoid. By reversing the arrows of mappend, we get:
comappend :: a -> (a,a)
By itself, this works well if the number of processors you have is a power of two, but it needs some more fanciness to get things balanced properly for other numbers of processors. I bet there's another algebraic structure that would capture these other cases, but I'm not sure what it is.
2) I'm working with parallelizing tree structures right now (kd-trees, cover trees, oct-trees, etc.). The real problem is not splitting the number of data points equally (this is easy), but splitting the amount of work equally. Some points take longer to process than others, and this cannot be determined in advance. Therefore, an equal split of the data points can result in one processor getting 25% of the work load, and the second processor getting 75%. Some sort of lazy Partitionable class that was aware of processor loads and didn't split data points until they were needed would be ideal for this scenario.
On Sat, Sep 28, 2013 at 6:46 PM, adam vogt
wrote: On Sat, Sep 28, 2013 at 1:09 PM, Ryan Newton
wrote: Hi all,
We all know and love Data.Foldable and are familiar with left folds and right folds. But what you want in a parallel program is a balanced fold over a tree. Fortunately, many of our datatypes (Sets, Maps) actually ARE balanced trees. Hmm, but how do we expose that?
Hi Ryan,
At least for Data.Map, the Foldable instance seems to have a reasonably balanced fold called fold (or foldMap):
fold t = go t where go (Bin _ _ v l r) = go l `mappend` (v `mappend` go r)
This doesn't seem to be guaranteed though. For example ghc's derived instance writes the foldr only, so fold would be right-associated for a:
data T a = B (T a) (T a) | L a deriving (Foldable)
Regards, Adam _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi Ryan,
-----Original message----- From: Ryan Newton
Sent: 29 Sep 2013, 04:21 <snip>
*class Partitionable t where* * partition :: t -> Maybe (t,t)*
<snip>
So what I really want is for the *containers package to please get some kind of Partitionable instances! * Johan & others, I would be happy to provide a patch if the class can be agreed on. This is important because currently the balanced tree structure of Data.Set/Map is an *amazing and beneficial property* that is *not* exposed at all through the API.
Some comments: 1) containers are a boot package (http://ghc.haskell.org/trac/ghc/wiki/Commentary/Libraries) therefore its dependencies have to be boot packages too. If Partitionable gets into base or some other boot package, fine :) 2) IntMap/IntSet have different partitioning operation than Map/Set: partition :: IntMap a -> Either IntMap (IntMap a, IntMap) partition :: Map k v -> Either Map (Map, k, v, Map) Nevertheless, IntMap/IntSet are not well balanced, so maybe it would be fine to have partition working for Map/Set. 3) Partition somehow exposes internal structure, which forces us to only some implementations. Nevertheless, I doubt the representation of containers will change wildly (although I am planning to add data constructor to Map/Set shortly, so you never know). It seems that the best course of action would be to ignore the Partitionable class and instead provide methods in the containers to allow splitting. The question is how should the API look like. Currently IntMap and IntSet are deliberately as close to Map and Set as possible. Introducing this splitting operation would enlarge the difference between them. But as noted, we could provide split only for Map and Set, as IntMap/IntSet are not well balanced anyway. Cheers, Milan

I don't know that it belongs in the "standard" libraries, but there could
definitely be a package for something similar.
ConstraintKinds are a pretty hefty extension to throw at it, and the
signature written there prevents it from being used on ByteString, Text,
etc.
This can be implemented with much lighter weight types though!
class Partitionable t where
partition :: Int -> t -> [t]
Now ByteString, Text etc. can be instances and no real flexibility is lost,
as with the class associated constraint on the argument, you'd already
given up polymorphic recursion.
There still remain issues. partition is already established as the
filterthat returns both the matching and unmatching elements, so the
name is
wrong.
This is a generalization of Data.List.splitEvery, perhaps it is worth
seeing how many others can be generalized similarly and talk to Brent about
adding, say, a Data.Split module to his split package in the platform?
-Edward
On Sun, Sep 29, 2013 at 4:21 AM, Ryan Newton
<subject change>
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
Mike -- Neat, that's a cool library!
Edward -- ideally, where in the standard libraries should the Partitionable comonoid go?
Btw, I'm not sure what the ideal return type for comappend is, given that it needs to be able to "bottom out". Mike, our partition function's list return type seems more reasonable. Or maybe something simple would be this:
*class Partitionable t where* * partition :: t -> Maybe (t,t)*
That is, at some point its not worth splitting and returns Nothing, and you'd better be able to deal with the 't' directly.
So what I really want is for the *containers package to please get some kind of Partitionable instances! * Johan & others, I would be happy to provide a patch if the class can be agreed on. This is important because currently the balanced tree structure of Data.Set/Map is an *amazing and beneficial property* that is *not* exposed at all through the API. For example, it would be great to have a parallel traverse_ for Maps and Sets in the Par monad. The particular impetus is that our new and enhanced Par monad makes extensive use of Maps and Sets, both the pure, balanced ones, and lockfree/inplace ones based on concurrent skip lists:
http://www.cs.indiana.edu/~rrnewton/haddock/lvish/
Alternatively, it would be ok if there were a "Data.Map.Internal" module that exposed the Bin/Tip, but I assume people would rather have a clean Partitionable instance...
Best, -Ryan
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
The function called "parallel" in the HLearn library will automatically parallelize any homomorphism from a Partionable to a Monoid. I specifically use that to parallelize machine learning algorithms.
I have two thoughts for better abstractions:
1) This Partitionable class is essentially a comonoid. By reversing the arrows of mappend, we get:
comappend :: a -> (a,a)
By itself, this works well if the number of processors you have is a power of two, but it needs some more fanciness to get things balanced properly for other numbers of processors. I bet there's another algebraic structure that would capture these other cases, but I'm not sure what it is.
2) I'm working with parallelizing tree structures right now (kd-trees, cover trees, oct-trees, etc.). The real problem is not splitting the number of data points equally (this is easy), but splitting the amount of work equally. Some points take longer to process than others, and this cannot be determined in advance. Therefore, an equal split of the data points can result in one processor getting 25% of the work load, and the second processor getting 75%. Some sort of lazy Partitionable class that was aware of processor loads and didn't split data points until they were needed would be ideal for this scenario.
On Sat, Sep 28, 2013 at 6:46 PM, adam vogt
wrote: On Sat, Sep 28, 2013 at 1:09 PM, Ryan Newton
wrote: Hi all,
We all know and love Data.Foldable and are familiar with left folds and right folds. But what you want in a parallel program is a balanced fold over a tree. Fortunately, many of our datatypes (Sets, Maps) actually ARE balanced trees. Hmm, but how do we expose that?
Hi Ryan,
At least for Data.Map, the Foldable instance seems to have a reasonably balanced fold called fold (or foldMap):
fold t = go t where go (Bin _ _ v l r) = go l `mappend` (v `mappend` go r)
This doesn't seem to be guaranteed though. For example ghc's derived instance writes the foldr only, so fold would be right-associated for a:
data T a = B (T a) (T a) | L a deriving (Foldable)
Regards, Adam _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Thanks Edward. Good point about Brent's 'split' package. That would be a
really nice place to put the class. But it doesn't currently depend on
containers or vector so I suppose the other instances would need to go
somewhere else. (Assuming containers only exported monomorphic versions.)
Maybe a next step would be proposing some monomorphic variants for the
containers package.
I think the complicated bit will be describing how "best-efforty" splitting
variants are:
- Is it guaranteed O(1) time and allocation?
- Is the provided Int an upper bound? Lower(ish) bound? Or just a hint?
With some data structures, there will be a trade-off between partition
imbalance and the work required to achieve balance. But with some data
structures it is happily not a problem (e.g. Vector)!
But whether there's one variant or a few, I'd be happy either way, as long
as I get at least the cheap one (i.e. prefer imbalance to restructuring).
-Ryan
On Sun, Sep 29, 2013 at 8:20 AM, Edward Kmett
I don't know that it belongs in the "standard" libraries, but there could definitely be a package for something similar.
ConstraintKinds are a pretty hefty extension to throw at it, and the signature written there prevents it from being used on ByteString, Text, etc.
This can be implemented with much lighter weight types though!
class Partitionable t where
partition :: Int -> t -> [t]
Now ByteString, Text etc. can be instances and no real flexibility is lost, as with the class associated constraint on the argument, you'd already given up polymorphic recursion.
There still remain issues. partition is already established as the filterthat returns both the matching and unmatching elements, so the name is wrong.
This is a generalization of Data.List.splitEvery, perhaps it is worth seeing how many others can be generalized similarly and talk to Brent about adding, say, a Data.Split module to his split package in the platform?
-Edward
On Sun, Sep 29, 2013 at 4:21 AM, Ryan Newton
wrote: <subject change>
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
Mike -- Neat, that's a cool library!
Edward -- ideally, where in the standard libraries should the Partitionable comonoid go?
Btw, I'm not sure what the ideal return type for comappend is, given that it needs to be able to "bottom out". Mike, our partition function's list return type seems more reasonable. Or maybe something simple would be this:
*class Partitionable t where* * partition :: t -> Maybe (t,t)*
That is, at some point its not worth splitting and returns Nothing, and you'd better be able to deal with the 't' directly.
So what I really want is for the *containers package to please get some kind of Partitionable instances! * Johan & others, I would be happy to provide a patch if the class can be agreed on. This is important because currently the balanced tree structure of Data.Set/Map is an *amazing and beneficial property* that is *not* exposed at all through the API. For example, it would be great to have a parallel traverse_ for Maps and Sets in the Par monad. The particular impetus is that our new and enhanced Par monad makes extensive use of Maps and Sets, both the pure, balanced ones, and lockfree/inplace ones based on concurrent skip lists:
http://www.cs.indiana.edu/~rrnewton/haddock/lvish/
Alternatively, it would be ok if there were a "Data.Map.Internal" module that exposed the Bin/Tip, but I assume people would rather have a clean Partitionable instance...
Best, -Ryan
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
The function called "parallel" in the HLearn library will automatically parallelize any homomorphism from a Partionable to a Monoid. I specifically use that to parallelize machine learning algorithms.
I have two thoughts for better abstractions:
1) This Partitionable class is essentially a comonoid. By reversing the arrows of mappend, we get:
comappend :: a -> (a,a)
By itself, this works well if the number of processors you have is a power of two, but it needs some more fanciness to get things balanced properly for other numbers of processors. I bet there's another algebraic structure that would capture these other cases, but I'm not sure what it is.
2) I'm working with parallelizing tree structures right now (kd-trees, cover trees, oct-trees, etc.). The real problem is not splitting the number of data points equally (this is easy), but splitting the amount of work equally. Some points take longer to process than others, and this cannot be determined in advance. Therefore, an equal split of the data points can result in one processor getting 25% of the work load, and the second processor getting 75%. Some sort of lazy Partitionable class that was aware of processor loads and didn't split data points until they were needed would be ideal for this scenario.
On Sat, Sep 28, 2013 at 6:46 PM, adam vogt
wrote: On Sat, Sep 28, 2013 at 1:09 PM, Ryan Newton
wrote: Hi all,
We all know and love Data.Foldable and are familiar with left folds and right folds. But what you want in a parallel program is a balanced fold over a tree. Fortunately, many of our datatypes (Sets, Maps) actually ARE balanced trees. Hmm, but how do we expose that?
Hi Ryan,
At least for Data.Map, the Foldable instance seems to have a reasonably balanced fold called fold (or foldMap):
fold t = go t where go (Bin _ _ v l r) = go l `mappend` (v `mappend` go r)
This doesn't seem to be guaranteed though. For example ghc's derived instance writes the foldr only, so fold would be right-associated for a:
data T a = B (T a) (T a) | L a deriving (Foldable)
Regards, Adam _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Besides just partition balance, the ordering of the resulting partitions is
important. For example, the most efficient way to partition a list is by
taking an every-other-n approach, whereas the most efficient way to
partition a vector is by using a slice. (This, BTW, might be a good
alternative name for the class to avoid the conflict Edward mentioned.)
These partitions are not necessarily usable in the same contexts. For
example, Vector's slicing strategy is always usable (only requires
associativity of the monoid to reduce over, which is guaranteed by the
laws). But the List's strategy also requires commutativity. This is not
guaranteed.
I would guess that maintaining the partition ordering and balance will be
at odds with each other in the Map and Set cases.
On Sun, Sep 29, 2013 at 8:06 PM, Ryan Newton
Thanks Edward. Good point about Brent's 'split' package. That would be a really nice place to put the class. But it doesn't currently depend on containers or vector so I suppose the other instances would need to go somewhere else. (Assuming containers only exported monomorphic versions.)
Maybe a next step would be proposing some monomorphic variants for the containers package.
I think the complicated bit will be describing how "best-efforty" splitting variants are:
- Is it guaranteed O(1) time and allocation? - Is the provided Int an upper bound? Lower(ish) bound? Or just a hint?
With some data structures, there will be a trade-off between partition imbalance and the work required to achieve balance. But with some data structures it is happily not a problem (e.g. Vector)!
But whether there's one variant or a few, I'd be happy either way, as long as I get at least the cheap one (i.e. prefer imbalance to restructuring).
-Ryan
On Sun, Sep 29, 2013 at 8:20 AM, Edward Kmett
wrote: I don't know that it belongs in the "standard" libraries, but there could definitely be a package for something similar.
ConstraintKinds are a pretty hefty extension to throw at it, and the signature written there prevents it from being used on ByteString, Text, etc.
This can be implemented with much lighter weight types though!
class Partitionable t where
partition :: Int -> t -> [t]
Now ByteString, Text etc. can be instances and no real flexibility is lost, as with the class associated constraint on the argument, you'd already given up polymorphic recursion.
There still remain issues. partition is already established as the filterthat returns both the matching and unmatching elements, so the name is wrong.
This is a generalization of Data.List.splitEvery, perhaps it is worth seeing how many others can be generalized similarly and talk to Brent about adding, say, a Data.Split module to his split package in the platform?
-Edward
On Sun, Sep 29, 2013 at 4:21 AM, Ryan Newton
wrote: <subject change>
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
Mike -- Neat, that's a cool library!
Edward -- ideally, where in the standard libraries should the Partitionable comonoid go?
Btw, I'm not sure what the ideal return type for comappend is, given that it needs to be able to "bottom out". Mike, our partition function's list return type seems more reasonable. Or maybe something simple would be this:
*class Partitionable t where* * partition :: t -> Maybe (t,t)*
That is, at some point its not worth splitting and returns Nothing, and you'd better be able to deal with the 't' directly.
So what I really want is for the *containers package to please get some kind of Partitionable instances! * Johan & others, I would be happy to provide a patch if the class can be agreed on. This is important because currently the balanced tree structure of Data.Set/Map is an *amazing and beneficial property* that is *not* exposed at all through the API. For example, it would be great to have a parallel traverse_ for Maps and Sets in the Par monad. The particular impetus is that our new and enhanced Par monad makes extensive use of Maps and Sets, both the pure, balanced ones, and lockfree/inplace ones based on concurrent skip lists:
http://www.cs.indiana.edu/~rrnewton/haddock/lvish/
Alternatively, it would be ok if there were a "Data.Map.Internal" module that exposed the Bin/Tip, but I assume people would rather have a clean Partitionable instance...
Best, -Ryan
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
The function called "parallel" in the HLearn library will automatically parallelize any homomorphism from a Partionable to a Monoid. I specifically use that to parallelize machine learning algorithms.
I have two thoughts for better abstractions:
1) This Partitionable class is essentially a comonoid. By reversing the arrows of mappend, we get:
comappend :: a -> (a,a)
By itself, this works well if the number of processors you have is a power of two, but it needs some more fanciness to get things balanced properly for other numbers of processors. I bet there's another algebraic structure that would capture these other cases, but I'm not sure what it is.
2) I'm working with parallelizing tree structures right now (kd-trees, cover trees, oct-trees, etc.). The real problem is not splitting the number of data points equally (this is easy), but splitting the amount of work equally. Some points take longer to process than others, and this cannot be determined in advance. Therefore, an equal split of the data points can result in one processor getting 25% of the work load, and the second processor getting 75%. Some sort of lazy Partitionable class that was aware of processor loads and didn't split data points until they were needed would be ideal for this scenario.
On Sat, Sep 28, 2013 at 6:46 PM, adam vogt
wrote: On Sat, Sep 28, 2013 at 1:09 PM, Ryan Newton
wrote: Hi all,
We all know and love Data.Foldable and are familiar with left folds and right folds. But what you want in a parallel program is a balanced fold over a tree. Fortunately, many of our datatypes (Sets, Maps) actually ARE balanced trees. Hmm, but how do we expose that?
Hi Ryan,
At least for Data.Map, the Foldable instance seems to have a reasonably balanced fold called fold (or foldMap):
fold t = go t where go (Bin _ _ v l r) = go l `mappend` (v `mappend` go r)
This doesn't seem to be guaranteed though. For example ghc's derived instance writes the foldr only, so fold would be right-associated for a:
data T a = B (T a) (T a) | L a deriving (Foldable)
Regards, Adam _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Upon consideration from a package management perspective this is probably
easiest done by building a new small package to provide the functionality
you want. That way we don't haphazardly change the transitive dependencies
of a big chunk of the ecosystem and it can rest atop the various containers
libraries. This also gives you a lot of opportunity to iterate on the API
in public without incurring the instant rigidity of the Haskell Platform.
On Sun, Sep 29, 2013 at 11:06 PM, Ryan Newton
Thanks Edward. Good point about Brent's 'split' package. That would be a really nice place to put the class. But it doesn't currently depend on containers or vector so I suppose the other instances would need to go somewhere else. (Assuming containers only exported monomorphic versions.)
Maybe a next step would be proposing some monomorphic variants for the containers package.
I think the complicated bit will be describing how "best-efforty" splitting variants are:
- Is it guaranteed O(1) time and allocation? - Is the provided Int an upper bound? Lower(ish) bound? Or just a hint?
With some data structures, there will be a trade-off between partition imbalance and the work required to achieve balance. But with some data structures it is happily not a problem (e.g. Vector)!
But whether there's one variant or a few, I'd be happy either way, as long as I get at least the cheap one (i.e. prefer imbalance to restructuring).
-Ryan
On Sun, Sep 29, 2013 at 8:20 AM, Edward Kmett
wrote: I don't know that it belongs in the "standard" libraries, but there could definitely be a package for something similar.
ConstraintKinds are a pretty hefty extension to throw at it, and the signature written there prevents it from being used on ByteString, Text, etc.
This can be implemented with much lighter weight types though!
class Partitionable t where
partition :: Int -> t -> [t]
Now ByteString, Text etc. can be instances and no real flexibility is lost, as with the class associated constraint on the argument, you'd already given up polymorphic recursion.
There still remain issues. partition is already established as the filterthat returns both the matching and unmatching elements, so the name is wrong.
This is a generalization of Data.List.splitEvery, perhaps it is worth seeing how many others can be generalized similarly and talk to Brent about adding, say, a Data.Split module to his split package in the platform?
-Edward
On Sun, Sep 29, 2013 at 4:21 AM, Ryan Newton
wrote: <subject change>
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
Mike -- Neat, that's a cool library!
Edward -- ideally, where in the standard libraries should the Partitionable comonoid go?
Btw, I'm not sure what the ideal return type for comappend is, given that it needs to be able to "bottom out". Mike, our partition function's list return type seems more reasonable. Or maybe something simple would be this:
*class Partitionable t where* * partition :: t -> Maybe (t,t)*
That is, at some point its not worth splitting and returns Nothing, and you'd better be able to deal with the 't' directly.
So what I really want is for the *containers package to please get some kind of Partitionable instances! * Johan & others, I would be happy to provide a patch if the class can be agreed on. This is important because currently the balanced tree structure of Data.Set/Map is an *amazing and beneficial property* that is *not* exposed at all through the API. For example, it would be great to have a parallel traverse_ for Maps and Sets in the Par monad. The particular impetus is that our new and enhanced Par monad makes extensive use of Maps and Sets, both the pure, balanced ones, and lockfree/inplace ones based on concurrent skip lists:
http://www.cs.indiana.edu/~rrnewton/haddock/lvish/
Alternatively, it would be ok if there were a "Data.Map.Internal" module that exposed the Bin/Tip, but I assume people would rather have a clean Partitionable instance...
Best, -Ryan
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
The function called "parallel" in the HLearn library will automatically parallelize any homomorphism from a Partionable to a Monoid. I specifically use that to parallelize machine learning algorithms.
I have two thoughts for better abstractions:
1) This Partitionable class is essentially a comonoid. By reversing the arrows of mappend, we get:
comappend :: a -> (a,a)
By itself, this works well if the number of processors you have is a power of two, but it needs some more fanciness to get things balanced properly for other numbers of processors. I bet there's another algebraic structure that would capture these other cases, but I'm not sure what it is.
2) I'm working with parallelizing tree structures right now (kd-trees, cover trees, oct-trees, etc.). The real problem is not splitting the number of data points equally (this is easy), but splitting the amount of work equally. Some points take longer to process than others, and this cannot be determined in advance. Therefore, an equal split of the data points can result in one processor getting 25% of the work load, and the second processor getting 75%. Some sort of lazy Partitionable class that was aware of processor loads and didn't split data points until they were needed would be ideal for this scenario.
On Sat, Sep 28, 2013 at 6:46 PM, adam vogt
wrote: On Sat, Sep 28, 2013 at 1:09 PM, Ryan Newton
wrote: Hi all,
We all know and love Data.Foldable and are familiar with left folds and right folds. But what you want in a parallel program is a balanced fold over a tree. Fortunately, many of our datatypes (Sets, Maps) actually ARE balanced trees. Hmm, but how do we expose that?
Hi Ryan,
At least for Data.Map, the Foldable instance seems to have a reasonably balanced fold called fold (or foldMap):
fold t = go t where go (Bin _ _ v l r) = go l `mappend` (v `mappend` go r)
This doesn't seem to be guaranteed though. For example ghc's derived instance writes the foldr only, so fold would be right-associated for a:
data T a = B (T a) (T a) | L a deriving (Foldable)
Regards, Adam _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Edward,
The problem is that I need *something* more from the containers library to
be able to construct this as a separate library. I don't think I can use
foldMap to implement a Splittable/Partitionable instance for Data.Set,
namely because I specifically want to do O(1) work instead of any kind of
full traversal of the structure.
Is the least possible disruption here to just have a Data.Map.Internal that
exposes Tip and Bin? It can be marked with suitable warnings at the top of
the module.
Or would the preference to be to expose something more abstract of type
"Map k a -> [Map k a]" that chops it into the "natural pieces"? [1]
-Ryan
[1] Btw, it seems like returning a tuple here might make deforestation more
likely than returning a list... right?
On Mon, Sep 30, 2013 at 9:52 AM, Edward Kmett
Upon consideration from a package management perspective this is probably easiest done by building a new small package to provide the functionality you want. That way we don't haphazardly change the transitive dependencies of a big chunk of the ecosystem and it can rest atop the various containers libraries. This also gives you a lot of opportunity to iterate on the API in public without incurring the instant rigidity of the Haskell Platform.
On Sun, Sep 29, 2013 at 11:06 PM, Ryan Newton
wrote: Thanks Edward. Good point about Brent's 'split' package. That would be a really nice place to put the class. But it doesn't currently depend on containers or vector so I suppose the other instances would need to go somewhere else. (Assuming containers only exported monomorphic versions.)
Maybe a next step would be proposing some monomorphic variants for the containers package.
I think the complicated bit will be describing how "best-efforty" splitting variants are:
- Is it guaranteed O(1) time and allocation? - Is the provided Int an upper bound? Lower(ish) bound? Or just a hint?
With some data structures, there will be a trade-off between partition imbalance and the work required to achieve balance. But with some data structures it is happily not a problem (e.g. Vector)!
But whether there's one variant or a few, I'd be happy either way, as long as I get at least the cheap one (i.e. prefer imbalance to restructuring).
-Ryan
On Sun, Sep 29, 2013 at 8:20 AM, Edward Kmett
wrote: I don't know that it belongs in the "standard" libraries, but there could definitely be a package for something similar.
ConstraintKinds are a pretty hefty extension to throw at it, and the signature written there prevents it from being used on ByteString, Text, etc.
This can be implemented with much lighter weight types though!
class Partitionable t where
partition :: Int -> t -> [t]
Now ByteString, Text etc. can be instances and no real flexibility is lost, as with the class associated constraint on the argument, you'd already given up polymorphic recursion.
There still remain issues. partition is already established as the filter that returns both the matching and unmatching elements, so the name is wrong.
This is a generalization of Data.List.splitEvery, perhaps it is worth seeing how many others can be generalized similarly and talk to Brent about adding, say, a Data.Split module to his split package in the platform?
-Edward
On Sun, Sep 29, 2013 at 4:21 AM, Ryan Newton
wrote: <subject change>
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
Mike -- Neat, that's a cool library!
Edward -- ideally, where in the standard libraries should the Partitionable comonoid go?
Btw, I'm not sure what the ideal return type for comappend is, given that it needs to be able to "bottom out". Mike, our partition function's list return type seems more reasonable. Or maybe something simple would be this:
*class Partitionable t where* * partition :: t -> Maybe (t,t)*
That is, at some point its not worth splitting and returns Nothing, and you'd better be able to deal with the 't' directly.
So what I really want is for the *containers package to please get some kind of Partitionable instances! * Johan & others, I would be happy to provide a patch if the class can be agreed on. This is important because currently the balanced tree structure of Data.Set/Map is an *amazing and beneficial property* that is *not* exposed at all through the API.
For example, it would be great to have a parallel traverse_ for Maps and Sets in the Par monad. The particular impetus is that our new and enhanced Par monad makes extensive use of Maps and Sets, both the pure, balanced ones, and lockfree/inplace ones based on concurrent skip lists:
http://www.cs.indiana.edu/~rrnewton/haddock/lvish/
Alternatively, it would be ok if there were a "Data.Map.Internal" module that exposed the Bin/Tip, but I assume people would rather have a clean Partitionable instance...
Best, -Ryan
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
The function called "parallel" in the HLearn library will automatically parallelize any homomorphism from a Partionable to a Monoid. I specifically use that to parallelize machine learning algorithms.
I have two thoughts for better abstractions:
1) This Partitionable class is essentially a comonoid. By reversing the arrows of mappend, we get:
comappend :: a -> (a,a)
By itself, this works well if the number of processors you have is a power of two, but it needs some more fanciness to get things balanced properly for other numbers of processors. I bet there's another algebraic structure that would capture these other cases, but I'm not sure what it is.
2) I'm working with parallelizing tree structures right now (kd-trees, cover trees, oct-trees, etc.). The real problem is not splitting the number of data points equally (this is easy), but splitting the amount of work equally. Some points take longer to process than others, and this cannot be determined in advance. Therefore, an equal split of the data points can result in one processor getting 25% of the work load, and the second processor getting 75%. Some sort of lazy Partitionable class that was aware of processor loads and didn't split data points until they were needed would be ideal for this scenario.
On Sat, Sep 28, 2013 at 6:46 PM, adam vogt
wrote: On Sat, Sep 28, 2013 at 1:09 PM, Ryan Newton
wrote: > Hi all, > > We all know and love Data.Foldable and are familiar with left folds and > right folds. But what you want in a parallel program is a balanced fold > over a tree. Fortunately, many of our datatypes (Sets, Maps) actually ARE > balanced trees. Hmm, but how do we expose that? Hi Ryan,
At least for Data.Map, the Foldable instance seems to have a reasonably balanced fold called fold (or foldMap):
> fold t = go t > where go (Bin _ _ v l r) = go l `mappend` (v `mappend` go r)
This doesn't seem to be guaranteed though. For example ghc's derived instance writes the foldr only, so fold would be right-associated for a:
> data T a = B (T a) (T a) | L a deriving (Foldable)
Regards, Adam _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Ok, so we've narrowed the focus quite a bit to JUST exposing enough from
containers to enable a third-party library to do all the parallel
traversals it wants. Which of the following limited proposal would people
like more?
(1) Expose Bin/Tip from, say, Data.Map.Internal, as in this patch:
https://github.com/rrnewton/containers/commit/5d6b07f69e8396023101039a4aaab6...
(2) a splitTree function [1]. A patch can be found here:
https://github.com/rrnewton/containers/commit/6153896f0c7e6cdf70656dc6b641ce...
The argument for (1) would be that it doesn't pollute any namespaces people
actually use at all, and Tip & Bin would seem to be pretty darn stable at
this point. The only consumers of this information in practice would be
downstream companion libraries (like, say, a parallel traversals library
for monad-par & LVish!) Those could be updated if there were ever a
seismic shift in the containers implementations.
Finally, I'd like to propose October 14th as a date to close discussion.
Best,
-Ryan
[1] Here's the relevant definition:
*-- | /O(1)/. Decompose a Map into pieces. No guarantee is made as to the
sizes of*
*-- the pieces, but some of them will be balanced, and some may be empty.*
*splitTree :: Map k b -> Maybe (Map k b, Map k b, Map k b)*
*splitTree orig =*
* case orig of *
* Tip -> Nothing*
* Bin 1 k v l r -> Just (singleton k v, l, r)*
*{-# INLINE splitTree #-}*
You could argue for returning a list instead, but I'm betting they'll be
harder to deforest. (I'll have to test it.) And the above should be
emulatable by any future implementation...
On Mon, Sep 30, 2013 at 11:28 AM, Ryan Newton
Edward,
The problem is that I need *something* more from the containers library to be able to construct this as a separate library. I don't think I can use foldMap to implement a Splittable/Partitionable instance for Data.Set, namely because I specifically want to do O(1) work instead of any kind of full traversal of the structure.
Is the least possible disruption here to just have a Data.Map.Internal that exposes Tip and Bin? It can be marked with suitable warnings at the top of the module.
Or would the preference to be to expose something more abstract of type "Map k a -> [Map k a]" that chops it into the "natural pieces"? [1]
-Ryan
[1] Btw, it seems like returning a tuple here might make deforestation more likely than returning a list... right?
On Mon, Sep 30, 2013 at 9:52 AM, Edward Kmett
wrote: Upon consideration from a package management perspective this is probably easiest done by building a new small package to provide the functionality you want. That way we don't haphazardly change the transitive dependencies of a big chunk of the ecosystem and it can rest atop the various containers libraries. This also gives you a lot of opportunity to iterate on the API in public without incurring the instant rigidity of the Haskell Platform.
On Sun, Sep 29, 2013 at 11:06 PM, Ryan Newton
wrote: Thanks Edward. Good point about Brent's 'split' package. That would be a really nice place to put the class. But it doesn't currently depend on containers or vector so I suppose the other instances would need to go somewhere else. (Assuming containers only exported monomorphic versions.)
Maybe a next step would be proposing some monomorphic variants for the containers package.
I think the complicated bit will be describing how "best-efforty" splitting variants are:
- Is it guaranteed O(1) time and allocation? - Is the provided Int an upper bound? Lower(ish) bound? Or just a hint?
With some data structures, there will be a trade-off between partition imbalance and the work required to achieve balance. But with some data structures it is happily not a problem (e.g. Vector)!
But whether there's one variant or a few, I'd be happy either way, as long as I get at least the cheap one (i.e. prefer imbalance to restructuring).
-Ryan
On Sun, Sep 29, 2013 at 8:20 AM, Edward Kmett
wrote: I don't know that it belongs in the "standard" libraries, but there could definitely be a package for something similar.
ConstraintKinds are a pretty hefty extension to throw at it, and the signature written there prevents it from being used on ByteString, Text, etc.
This can be implemented with much lighter weight types though!
class Partitionable t where
partition :: Int -> t -> [t]
Now ByteString, Text etc. can be instances and no real flexibility is lost, as with the class associated constraint on the argument, you'd already given up polymorphic recursion.
There still remain issues. partition is already established as the filter that returns both the matching and unmatching elements, so the name is wrong.
This is a generalization of Data.List.splitEvery, perhaps it is worth seeing how many others can be generalized similarly and talk to Brent about adding, say, a Data.Split module to his split package in the platform?
-Edward
On Sun, Sep 29, 2013 at 4:21 AM, Ryan Newton
wrote: <subject change>
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
Mike -- Neat, that's a cool library!
Edward -- ideally, where in the standard libraries should the Partitionable comonoid go?
Btw, I'm not sure what the ideal return type for comappend is, given that it needs to be able to "bottom out". Mike, our partition function's list return type seems more reasonable. Or maybe something simple would be this:
*class Partitionable t where* * partition :: t -> Maybe (t,t)*
That is, at some point its not worth splitting and returns Nothing, and you'd better be able to deal with the 't' directly.
So what I really want is for the *containers package to please get some kind of Partitionable instances! * Johan & others, I would be happy to provide a patch if the class can be agreed on. This is important because currently the balanced tree structure of Data.Set/Map is an *amazing and beneficial property* that is *not* exposed at all through the API. For example, it would be great to have a parallel traverse_ for Maps and Sets in the Par monad. The particular impetus is that our new and enhanced Par monad makes extensive use of Maps and Sets, both the pure, balanced ones, and lockfree/inplace ones based on concurrent skip lists:
http://www.cs.indiana.edu/~rrnewton/haddock/lvish/
Alternatively, it would be ok if there were a "Data.Map.Internal" module that exposed the Bin/Tip, but I assume people would rather have a clean Partitionable instance...
Best, -Ryan
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: I've got a Partitionable class that I've been using for this purpose:
https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const...
The function called "parallel" in the HLearn library will automatically parallelize any homomorphism from a Partionable to a Monoid. I specifically use that to parallelize machine learning algorithms.
I have two thoughts for better abstractions:
1) This Partitionable class is essentially a comonoid. By reversing the arrows of mappend, we get:
comappend :: a -> (a,a)
By itself, this works well if the number of processors you have is a power of two, but it needs some more fanciness to get things balanced properly for other numbers of processors. I bet there's another algebraic structure that would capture these other cases, but I'm not sure what it is.
2) I'm working with parallelizing tree structures right now (kd-trees, cover trees, oct-trees, etc.). The real problem is not splitting the number of data points equally (this is easy), but splitting the amount of work equally. Some points take longer to process than others, and this cannot be determined in advance. Therefore, an equal split of the data points can result in one processor getting 25% of the work load, and the second processor getting 75%. Some sort of lazy Partitionable class that was aware of processor loads and didn't split data points until they were needed would be ideal for this scenario.
On Sat, Sep 28, 2013 at 6:46 PM, adam vogt
wrote: > On Sat, Sep 28, 2013 at 1:09 PM, Ryan Newton
> wrote: > > Hi all, > > > > We all know and love Data.Foldable and are familiar with left > folds and > > right folds. But what you want in a parallel program is a > balanced fold > > over a tree. Fortunately, many of our datatypes (Sets, Maps) > actually ARE > > balanced trees. Hmm, but how do we expose that? > > Hi Ryan, > > At least for Data.Map, the Foldable instance seems to have a > reasonably balanced fold called fold (or foldMap): > > > fold t = go t > > where go (Bin _ _ v l r) = go l `mappend` (v `mappend` go r) > > This doesn't seem to be guaranteed though. For example ghc's derived > instance writes the foldr only, so fold would be right-associated for > a: > > > data T a = B (T a) (T a) | L a deriving (Foldable) > > Regards, > Adam > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe >

I'm in favor of Option 1 (exposing Bin/Tip) on general principles.
On Sun, Oct 6, 2013 at 11:16 PM, Ryan Newton
Ok, so we've narrowed the focus quite a bit to JUST exposing enough from containers to enable a third-party library to do all the parallel traversals it wants. Which of the following limited proposal would people like more?
(1) Expose Bin/Tip from, say, Data.Map.Internal, as in this patch:
https://github.com/rrnewton/containers/commit/5d6b07f69e8396023101039a4aaab6...
(2) a splitTree function [1]. A patch can be found here: https://github.com/rrnewton/containers/commit/6153896f0c7e6cdf70656dc6b641ce...
The argument for (1) would be that it doesn't pollute any namespaces people actually use at all, and Tip & Bin would seem to be pretty darn stable at this point. The only consumers of this information in practice would be downstream companion libraries (like, say, a parallel traversals library for monad-par & LVish!) Those could be updated if there were ever a seismic shift in the containers implementations.
Finally, I'd like to propose October 14th as a date to close discussion.
Best, -Ryan
[1] Here's the relevant definition:
*-- | /O(1)/. Decompose a Map into pieces. No guarantee is made as to the sizes of* *-- the pieces, but some of them will be balanced, and some may be empty.* *splitTree :: Map k b -> Maybe (Map k b, Map k b, Map k b)* *splitTree orig =* * case orig of * * Tip -> Nothing* * Bin 1 k v l r -> Just (singleton k v, l, r)* *{-# INLINE splitTree #-}*
You could argue for returning a list instead, but I'm betting they'll be harder to deforest. (I'll have to test it.) And the above should be emulatable by any future implementation...
On Mon, Sep 30, 2013 at 11:28 AM, Ryan Newton
wrote: Edward,
The problem is that I need *something* more from the containers library to be able to construct this as a separate library. I don't think I can use foldMap to implement a Splittable/Partitionable instance for Data.Set, namely because I specifically want to do O(1) work instead of any kind of full traversal of the structure.
Is the least possible disruption here to just have a Data.Map.Internal that exposes Tip and Bin? It can be marked with suitable warnings at the top of the module.
Or would the preference to be to expose something more abstract of type "Map k a -> [Map k a]" that chops it into the "natural pieces"? [1]
-Ryan
[1] Btw, it seems like returning a tuple here might make deforestation more likely than returning a list... right?
On Mon, Sep 30, 2013 at 9:52 AM, Edward Kmett
wrote: Upon consideration from a package management perspective this is probably easiest done by building a new small package to provide the functionality you want. That way we don't haphazardly change the transitive dependencies of a big chunk of the ecosystem and it can rest atop the various containers libraries. This also gives you a lot of opportunity to iterate on the API in public without incurring the instant rigidity of the Haskell Platform.
On Sun, Sep 29, 2013 at 11:06 PM, Ryan Newton
wrote: Thanks Edward. Good point about Brent's 'split' package. That would be a really nice place to put the class. But it doesn't currently depend on containers or vector so I suppose the other instances would need to go somewhere else. (Assuming containers only exported monomorphic versions.)
Maybe a next step would be proposing some monomorphic variants for the containers package.
I think the complicated bit will be describing how "best-efforty" splitting variants are:
- Is it guaranteed O(1) time and allocation? - Is the provided Int an upper bound? Lower(ish) bound? Or just a hint?
With some data structures, there will be a trade-off between partition imbalance and the work required to achieve balance. But with some data structures it is happily not a problem (e.g. Vector)!
But whether there's one variant or a few, I'd be happy either way, as long as I get at least the cheap one (i.e. prefer imbalance to restructuring).
-Ryan
On Sun, Sep 29, 2013 at 8:20 AM, Edward Kmett
wrote: I don't know that it belongs in the "standard" libraries, but there could definitely be a package for something similar.
ConstraintKinds are a pretty hefty extension to throw at it, and the signature written there prevents it from being used on ByteString, Text, etc.
This can be implemented with much lighter weight types though!
class Partitionable t where
partition :: Int -> t -> [t]
Now ByteString, Text etc. can be instances and no real flexibility is lost, as with the class associated constraint on the argument, you'd already given up polymorphic recursion.
There still remain issues. partition is already established as the filter that returns both the matching and unmatching elements, so the name is wrong.
This is a generalization of Data.List.splitEvery, perhaps it is worth seeing how many others can be generalized similarly and talk to Brent about adding, say, a Data.Split module to his split package in the platform?
-Edward
On Sun, Sep 29, 2013 at 4:21 AM, Ryan Newton
wrote: <subject change>
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: > I've got a Partitionable class that I've been using for this purpose: > > https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const... >
Mike -- Neat, that's a cool library!
Edward -- ideally, where in the standard libraries should the Partitionable comonoid go?
Btw, I'm not sure what the ideal return type for comappend is, given that it needs to be able to "bottom out". Mike, our partition function's list return type seems more reasonable. Or maybe something simple would be this:
*class Partitionable t where* * partition :: t -> Maybe (t,t)*
That is, at some point its not worth splitting and returns Nothing, and you'd better be able to deal with the 't' directly.
So what I really want is for the *containers package to please get some kind of Partitionable instances! * Johan & others, I would be happy to provide a patch if the class can be agreed on. This is important because currently the balanced tree structure of Data.Set/Map is an *amazing and beneficial property* that is *not* exposed at all through the API. For example, it would be great to have a parallel traverse_ for Maps and Sets in the Par monad. The particular impetus is that our new and enhanced Par monad makes extensive use of Maps and Sets, both the pure, balanced ones, and lockfree/inplace ones based on concurrent skip lists:
http://www.cs.indiana.edu/~rrnewton/haddock/lvish/
Alternatively, it would be ok if there were a "Data.Map.Internal" module that exposed the Bin/Tip, but I assume people would rather have a clean Partitionable instance...
Best, -Ryan
On Sun, Sep 29, 2013 at 3:31 AM, Mike Izbicki
wrote: > I've got a Partitionable class that I've been using for this purpose: > > > https://github.com/mikeizbicki/ConstraintKinds/blob/master/src/Control/Const... > > The function called "parallel" in the HLearn library will > automatically parallelize any homomorphism from a Partionable to a Monoid. > I specifically use that to parallelize machine learning algorithms. > > I have two thoughts for better abstractions: > > 1) This Partitionable class is essentially a comonoid. By > reversing the arrows of mappend, we get: > > comappend :: a -> (a,a) > > By itself, this works well if the number of processors you have is a > power of two, but it needs some more fanciness to get things balanced > properly for other numbers of processors. I bet there's another algebraic > structure that would capture these other cases, but I'm not sure what it is. > > 2) I'm working with parallelizing tree structures right now > (kd-trees, cover trees, oct-trees, etc.). The real problem is not > splitting the number of data points equally (this is easy), but splitting > the amount of work equally. Some points take longer to process than > others, and this cannot be determined in advance. Therefore, an equal > split of the data points can result in one processor getting 25% of the > work load, and the second processor getting 75%. Some sort of lazy > Partitionable class that was aware of processor loads and didn't split data > points until they were needed would be ideal for this scenario. > > On Sat, Sep 28, 2013 at 6:46 PM, adam vogt
wrote: > >> On Sat, Sep 28, 2013 at 1:09 PM, Ryan Newton >> wrote: >> > Hi all, >> > >> > We all know and love Data.Foldable and are familiar with left >> folds and >> > right folds. But what you want in a parallel program is a >> balanced fold >> > over a tree. Fortunately, many of our datatypes (Sets, Maps) >> actually ARE >> > balanced trees. Hmm, but how do we expose that? >> >> Hi Ryan, >> >> At least for Data.Map, the Foldable instance seems to have a >> reasonably balanced fold called fold (or foldMap): >> >> > fold t = go t >> > where go (Bin _ _ v l r) = go l `mappend` (v `mappend` go r) >> >> This doesn't seem to be guaranteed though. For example ghc's derived >> instance writes the foldr only, so fold would be right-associated >> for >> a: >> >> > data T a = B (T a) (T a) | L a deriving (Foldable) >> >> Regards, >> Adam >> _______________________________________________ >> Haskell-Cafe mailing list >> Haskell-Cafe@haskell.org >> http://www.haskell.org/mailman/listinfo/haskell-cafe >> > > _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Hi all,
-----Original message----- From: Ryan Newton
Sent: 7 Oct 2013, 00:16 Ok, so we've narrowed the focus quite a bit to JUST exposing enough from containers to enable a third-party library to do all the parallel traversals it wants. Which of the following limited proposal would people like more?
(1) Expose Bin/Tip from, say, Data.Map.Internal, as in this patch:
https://github.com/rrnewton/containers/commit/5d6b07f69e8396023101039a4aaab6...
(2) a splitTree function [1]. A patch can be found here: https://github.com/rrnewton/containers/commit/6153896f0c7e6cdf70656dc6b641ce...
The argument for (1) would be that it doesn't pollute any namespaces people actually use at all, and Tip & Bin would seem to be pretty darn stable at this point. The only consumers of this information in practice would be downstream companion libraries (like, say, a parallel traversals library for monad-par & LVish!) Those could be updated if there were ever a seismic shift in the containers implementations.
I am strongly against (1). Exposing internal representation seem really wrong. FYI, I am planning to change the representation of Data.Map and Data.Set to a three constructor representation (I have already some benchmarks, halving time complexity of fold and decreasing memory usage by ~ 20%). So no, the internal representation is subject to change and I do not want it to become part of API. As for (2), I am not very happy about the type -- returning _three_ maps makes some assumptions about the internal representation. This can be seen when considering IntMap.splitTree -- there are no three IntMaps to return in a IntMap.splitTree, only two. What about the list version splitTree :: Map k a -> [Map k a] If splitTree is INLINE, I think we can assume the deforestation will happen. That would allow us to define splitTree for IntMap too. If someone is worried, could they check that deforestation does really happen? Cheers, Milan

A rule of thumb that has served me well w.r.t exposing internal modules is
to expose a Data.Foo.Internal but make it clear it is a very fragile
interface.
Even by going to far as to say this module does not follow the PVP and that
they should expect breaking changes to come fast and often. Users should
only safely depend on it with minor-version specific bounds then. This
ameliorates the concerns about how it ties your hands as an implementor.
Breaking this API shouldn't require discussion on the mailing lists, as it
is an internal implementation detail. This should further ameliorate
concerns about it tying your hands as an implementor.
This lets users who need to write performant code not have to fork the
entire package. (I've had to do this with Map before, Text and other
packages that have been rather hide-bound about not exposing implementation
details, it sucks.)
My experience is maybe 1-2% of your users need it, but when they need it it
is the difference between the package being usable or having to be
completely replaced with something else. They are also the kind of users
who understand the need for the best possible implementation and who will
roll with the punches.
Chasing after changes in the implementation is generally far less work than
maintaining an entire fork.
-Edward
On Mon, Oct 7, 2013 at 4:10 AM, Milan Straka
Hi all,
-----Original message----- From: Ryan Newton
Sent: 7 Oct 2013, 00:16 Ok, so we've narrowed the focus quite a bit to JUST exposing enough from containers to enable a third-party library to do all the parallel traversals it wants. Which of the following limited proposal would people like more?
(1) Expose Bin/Tip from, say, Data.Map.Internal, as in this patch:
https://github.com/rrnewton/containers/commit/5d6b07f69e8396023101039a4aaab6...
(2) a splitTree function [1]. A patch can be found here:
https://github.com/rrnewton/containers/commit/6153896f0c7e6cdf70656dc6b641ce...
The argument for (1) would be that it doesn't pollute any namespaces
people
actually use at all, and Tip & Bin would seem to be pretty darn stable at this point. The only consumers of this information in practice would be downstream companion libraries (like, say, a parallel traversals library for monad-par & LVish!) Those could be updated if there were ever a seismic shift in the containers implementations.
I am strongly against (1). Exposing internal representation seem really wrong. FYI, I am planning to change the representation of Data.Map and Data.Set to a three constructor representation (I have already some benchmarks, halving time complexity of fold and decreasing memory usage by ~ 20%). So no, the internal representation is subject to change and I do not want it to become part of API.
As for (2), I am not very happy about the type -- returning _three_ maps makes some assumptions about the internal representation. This can be seen when considering IntMap.splitTree -- there are no three IntMaps to return in a IntMap.splitTree, only two.
What about the list version splitTree :: Map k a -> [Map k a] If splitTree is INLINE, I think we can assume the deforestation will happen. That would allow us to define splitTree for IntMap too. If someone is worried, could they check that deforestation does really happen?
Cheers, Milan

On Mon, 7 Oct 2013, Edward Kmett wrote:
A rule of thumb that has served me well w.r.t exposing internal modules is to expose a Data.Foo.Internal but make it clear it is a very fragile interface. Even by going to far as to say this module does not follow the PVP and that they should expect breaking changes to come fast and often. Users should only safely depend on it with minor-version specific bounds then. This ameliorates the concerns about how it ties your hands as an implementor.
A PVP compliant solution would be to divide 'containers' into 'containers-internal' and 'containers'. The new package 'containers-internal' exposes internal data structures, but increases version numbers at a higher rate than 'containers'. The package 'containers' would only expose the public API and thus a more stable interface.

While it is PVP compliant, it would also tie the hands of the developers enough that I don't see many developers wanting to adopt it. Working across multiple packages means very painful changes to your build process if you don't use cabal install for everything. It is even rather painful if you do use cabal install for everything. A slightly more palatable PVP compliant version which burns through version numbers much faster is to separate the two digits of the major version and use one for .Internal changes and one for changes to the majority of the public API. Then users who only use the public API can depend on the first digit. However, this has awkward aesthetics as your version numbers start skyrocketing. I'm not a fan of either approach. -Edward On Mon, Oct 7, 2013 at 10:35 AM, Henning Thielemann < lemming@henning-thielemann.de> wrote:
On Mon, 7 Oct 2013, Edward Kmett wrote:
A rule of thumb that has served me well w.r.t exposing internal modules
is to expose a Data.Foo.Internal but make it clear it is a very fragile interface. Even by going to far as to say this module does not follow the PVP and that they should expect breaking changes to come fast and often. Users should only safely depend on it with minor-version specific bounds then. This ameliorates the concerns about how it ties your hands as an implementor.
A PVP compliant solution would be to divide 'containers' into 'containers-internal' and 'containers'. The new package 'containers-internal' exposes internal data structures, but increases version numbers at a higher rate than 'containers'. The package 'containers' would only expose the public API and thus a more stable interface.

Hi all,
Thanks for more feedback.
I tend to not be afraid of the ".Internal" module that exposes stuff that
changes, with appropriate caveats. But it also seems like relatively
little effort to have a cleaner public "splitTree" interface in this case,
hopefully without extra overhead.
I propose to test simple use cases of a tuple and list version of splitTree
and check out the deforestation properties.
Milan, I know the three-tuple version looks weird. But since it's OK for
some of those three chunks to be empty I think most future implementations
would have an easy way to maintain compatibility, wouldn't they?
UNLESS, that is,y our new changes involve trees that would expose
*more*than three-way-splits at a time (4-way, 5-way, etc). Do they?
Cheers,
-Ryan
On Mon, Oct 7, 2013 at 12:51 PM, Edward Kmett
While it is PVP compliant, it would also tie the hands of the developers enough that I don't see many developers wanting to adopt it. Working across multiple packages means very painful changes to your build process if you don't use cabal install for everything. It is even rather painful if you do use cabal install for everything.
A slightly more palatable PVP compliant version which burns through version numbers much faster is to separate the two digits of the major version and use one for .Internal changes and one for changes to the majority of the public API. Then users who only use the public API can depend on the first digit. However, this has awkward aesthetics as your version numbers start skyrocketing.
I'm not a fan of either approach.
-Edward
On Mon, Oct 7, 2013 at 10:35 AM, Henning Thielemann < lemming@henning-thielemann.de> wrote:
On Mon, 7 Oct 2013, Edward Kmett wrote:
A rule of thumb that has served me well w.r.t exposing internal modules
is to expose a Data.Foo.Internal but make it clear it is a very fragile interface. Even by going to far as to say this module does not follow the PVP and that they should expect breaking changes to come fast and often. Users should only safely depend on it with minor-version specific bounds then. This ameliorates the concerns about how it ties your hands as an implementor.
A PVP compliant solution would be to divide 'containers' into 'containers-internal' and 'containers'. The new package 'containers-internal' exposes internal data structures, but increases version numbers at a higher rate than 'containers'. The package 'containers' would only expose the public API and thus a more stable interface.

Hi Ryan,
-----Original message----- From: Ryan Newton
Sent: 7 Oct 2013, 13:41 Hi all,
Thanks for more feedback.
I tend to not be afraid of the ".Internal" module that exposes stuff that changes, with appropriate caveats. But it also seems like relatively little effort to have a cleaner public "splitTree" interface in this case, hopefully without extra overhead.
I propose to test simple use cases of a tuple and list version of splitTree and check out the deforestation properties.
Milan, I know the three-tuple version looks weird. But since it's OK for some of those three chunks to be empty I think most future implementations would have an easy way to maintain compatibility, wouldn't they?
Yes, they would. Nevertheless, it seems to me that because a Map can be splitted in one, two or three subtrees, it would make sense for splitTree to return either zero, one, two or three subtrees, instead of three, some of them empty.
UNLESS, that is,y our new changes involve trees that would expose *more*than three-way-splits at a time (4-way, 5-way, etc). Do they?
No, they do not. A limit of three subtrees is fine for containers (3 for Set/Map, 2 for IntSet/IntMap in all planned improvements). But consider for example unordered-containers. The HashMap is a multiway tree with branching factor 1-16. On that account, maybe we should use a different name than splitTree -- I do not like the Tree in the name, because tree is only a particular data structure (for example, IntSet is actually a dense compressed trie). What about something in the lines of 'divide'? Only to be sure -- I support the list version provided that its performance is not worse than the tuple version. I did some tests with lists only (not Map) and it seems it is fine. But a thorough test is still needed. Cheers, Milan

On 09/29/13 08:20, Edward Kmett wrote:
I don't know that it belongs in the "standard" libraries, but there could definitely be a package for something similar.
ConstraintKinds are a pretty hefty extension to throw at it, and the signature written there prevents it from being used on ByteString, Text, etc.
This can be implemented with much lighter weight types though! class Partitionable t where partition :: Int -> t -> [t]
I'm not sure why I don't already have this method in the FactorialMonoid class, but I'll happily add it if anybody wants it. Probably under the name splitEvery, since I already have splitAt. I'm not sure this is actually the best answer to Ryan's original plea, because I thought the idea was to let the original monoid "split itself" in an optimal way, which would preferably be an O(1) operation. Then again, this could be overly optimistic. For example, Map is defined as data Map k a = Bin {-# UNPACK #-} !Size !k a !(Map k a) !(Map k a) | Tip so the simple O(1) split would produce three submaps, the middle one having only one element. This operation would not be very parallelization-friendly. That is not particularly surprising, since parallelization was not the main concern when Data.Map (or containers) was originally written. The main goal, as it should have been, was optimizing the containers for sequential execution speed. A containers-like package optimized for easy and efficient parallelization would have to be written almost from scratch.

so the simple O(1) split would produce three submaps, the middle one having only one element. This operation would not be very parallelization-friendly.
Actually, I'm perfectly happy with that in this case! - A decent work-stealing system can tolerate a fairly large number of excessively small, trivial computations. It's having *only* those that's a big problem. (Which is what you often get if your parallel container ops spawn a task per element.) - Since Maps support O(1) size, the consumer of the split-up-map could choose to sequentially execute the singleton maps if desired. Personally, I'm most interested in set-like operations and don't need any order guarantees. But that's another dimension in which one could chop up the API... Maybe this does deserve its own module in the namespace, and maybe its own package, as Edward suggested. -Ryan
participants (8)
-
Edward Kmett
-
Henning Thielemann
-
John Lato
-
Mario Blažević
-
Mike Izbicki
-
Milan Straka
-
Nicolas Trangez
-
Ryan Newton