Haskell platform proposal: split package

Hello everyone, This is a proposal for the split package [1] to be included in the next major release of the Haskell platform. Everyone is invited to review this proposal, following the standard procedure [2] for proposing and reviewing packages. Review comments should be sent to the libraries mailing list by August 20 (arbitrarily chosen; there's plenty of time before the October 1 deadline [3]). The Haskell Platform wiki will be kept up-to-date with the results of the review process: http://trac.haskell.org/haskell-platform/wiki/Proposals/split [1] http://hackage.haskell.org/package/split [2] http://trac.haskell.org/haskell-platform/wiki/AddingPackages [3] http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable Credits ======= Proposal author and package maintainer: Brent Yorgey <byorgey at cis.upenn.edu> Abstract ======== The Data.List.Split module contains a wide range of strategies for splitting lists with respect to some sort of delimiter, mostly implemented through a unified combinator interface. The goal is to be a flexible yet simple alternative to the standard 'split' function found in some other mainstream languages. Documentation and tarball from the hackage page: http://hackage.haskell.org/package/split Development repo: darcs get http://code.haskell.org/~byorgey/code/split Rationale ========= Splitting a list into chunks based on some sort of delimiter(s) is a common need, and is provided in the standard libraries of several mainstream languages (e.g. Python [4], Ruby [5], Java [6]). Haskell beginners routinely ask whether such a function exists in the standard libraries. For a long time, the answer was no. Adding such a function to Haskell's standard libraries has been proposed multiple times over the years, but consensus was never reached on the design of such a function. (See, e.g. [7, 8, 9].) [4] http://docs.python.org/py3k/library/stdtypes.html?highlight=split#str.split [5] http://www.ruby-doc.org/core-1.9.3/String.html#method-i-split [6] http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.la...) [7] http://www.haskell.org/pipermail/libraries/2006-July/005504.html [8] http://www.haskell.org/pipermail/libraries/2006-October/006072.html [9] http://www.haskell.org/pipermail/libraries/2008-January/008922.html In December 2008 the split package was released, implementing not just a single split method, but a wide range of splitting strategies. Since then the split package has gained wide acceptance, with almost 95 reverse dependencies [10], putting it in the top 40 for number of reverse dependencies on Hackage. [10] http://packdeps.haskellers.com/reverse/split The package is quite stable. Since the 0.1.4 release in April 2011 only very minor updates have been made. It has a large suite of QuickCheck properties [11]; to my recollection no bugs have ever been reported. [11] http://code.haskell.org/~byorgey/code/split/Properties.hs API === For a detailed description of the package API and example usage, see the Haddock documentation: http://hackage.haskell.org/packages/archive/split/0.1.4.3/doc/html/Data-List... Design decisions ================ Most of the library is based around a (rather simple) combinator interface. Combinators are used to build up configuration records (recording options such as whether to keep delimiters, whether to keep blank segments, etc). A configuration record is finally handed off to a function which performs a generic maximally-information-preserving splitting algorithm and then does various postprocessing steps (based on the configuration) to selectively throw information away. It is probably not the fastest way to implement these methods, but speed is explicitly not a design goal: the aim is to provide a reasonably wide range of splitting strategies which can be used simply. Blazing speed (or more complex processing), when needed, can be obtained from a proper parsing package. Open issues =========== Use of GHC.Exts --------------- At the request of a user, the 0.1.4.3 release switched from defining its own version of the standard 'build' function, to importing it from GHC.Exts. This allows GHC to do more optimization, resulting in reported speedups to uses of splitEvery, splitPlaces, and splitPlacesBlanks. However, this makes the library GHC-specific. If any reviewers think this is an issue I would be willing to go back to defining build by hand, or use CPP macros to select between build implementations based on the compiler. Missing strategies ------------------ The specific way that the generic splitting algorithm is implemented does preclude some imaginable splitting strategies. For example, a few years ago I tried adding a strategy that used a predicate on pairs of elements, splitting down the middle of any pairs that satisfy the predicate, but gave up because it simply did not fit.

On Fri, 20 Jul 2012, Brent Yorgey wrote:
Use of GHC.Exts ---------------
At the request of a user, the 0.1.4.3 release switched from defining its own version of the standard 'build' function, to importing it from GHC.Exts. This allows GHC to do more optimization, resulting in reported speedups to uses of splitEvery, splitPlaces, and splitPlacesBlanks. However, this makes the library GHC-specific. If any reviewers think this is an issue I would be willing to go back to defining build by hand, or use CPP macros to select between build implementations based on the compiler.
You could provide two private modules with the same name in different directories, one that re-exports 'build' from GHC.Exts and one with a custom definition of 'build' for non-GHC compilers. Then set 'Hs-Source-Dirs' in Cabal according to the impl(ghc). No CPP needed, only Cabal. One could even think of a separate package for this purpose. The only type extension you use, is GADTs, right? It looks like you use it for an Eq constraint in Delimiter/DelimSublist. That is, you actually need only ExistentialQuantification. Is it necessary?

On Fri, 20 Jul 2012, Henning Thielemann wrote:
The only type extension you use, is GADTs, right? It looks like you use it for an Eq constraint in Delimiter/DelimSublist. That is, you actually need only ExistentialQuantification. Is it necessary?
Hm, this (Eq a) isn't even part of an existential quantification. So where is GADT needed?

On Fri, Jul 20, 2012 at 11:25:47PM +0200, Henning Thielemann wrote:
On Fri, 20 Jul 2012, Brent Yorgey wrote:
Use of GHC.Exts ---------------
At the request of a user, the 0.1.4.3 release switched from defining its own version of the standard 'build' function, to importing it from GHC.Exts. This allows GHC to do more optimization, resulting in reported speedups to uses of splitEvery, splitPlaces, and splitPlacesBlanks. However, this makes the library GHC-specific. If any reviewers think this is an issue I would be willing to go back to defining build by hand, or use CPP macros to select between build implementations based on the compiler.
You could provide two private modules with the same name in different directories, one that re-exports 'build' from GHC.Exts and one with a custom definition of 'build' for non-GHC compilers. Then set 'Hs-Source-Dirs' in Cabal according to the impl(ghc). No CPP needed, only Cabal. One could even think of a separate package for this purpose.
Ah, this is a good idea. I'd still like to hear from other reviewers whether they think it is worth the trouble. To what extent should the Haskell Platform try to be compiler-agnostic (even though it includes GHC)?
The only type extension you use, is GADTs, right? It looks like you use it for an Eq constraint in Delimiter/DelimSublist. That is, you actually need only ExistentialQuantification. Is it necessary?
You are right, actually, only ExistentialQuantification is necessary, as long as we also stop using GADT syntax. I didn't realize before that this syntax is accepted: {-# LANGUAGE ExistentialQuantification #-} data Delimiter a = DelimEltPred (a -> Bool) | Eq a => DelimSublist [a] I do agree that this is a bit weird, what's going on here is not exactly existential quantification. But in any case the ExistentialQuantification extension turns on this ability to embed class constraints in data constructors -- at least in GHC. I am happy to make this change, but I guess it again raises the issue GHC-specificity. As to whether there is a way to do this without using ExistentialQuantification, I don't see an obvious solution (though there probably is one). The issue is that we certainly don't want to require an (Eq a) constraint when using a predicate, but we need one when matching sublists. I'm open to suggestions. -Brent

On Fri, 20 Jul 2012, Brent Yorgey wrote:
As to whether there is a way to do this without using ExistentialQuantification, I don't see an obvious solution (though there probably is one). The issue is that we certainly don't want to require an (Eq a) constraint when using a predicate, but we need one when matching sublists. I'm open to suggestions.
If you do not embed the Eq dictionary into DelimSublist then you have to add the Eq constraint to some functions. Is this a problem? In turn you would get a perfectly portable Haskell 98 module. See attached patch. You would still need to do a major version bump.

On Sat, Jul 21, 2012 at 09:34:13AM +0200, Henning Thielemann wrote:
On Fri, 20 Jul 2012, Brent Yorgey wrote:
As to whether there is a way to do this without using ExistentialQuantification, I don't see an obvious solution (though there probably is one). The issue is that we certainly don't want to require an (Eq a) constraint when using a predicate, but we need one when matching sublists. I'm open to suggestions.
If you do not embed the Eq dictionary into DelimSublist then you have to add the Eq constraint to some functions. Is this a problem? In turn you would get a perfectly portable Haskell 98 module.
No, I do not want to do this. The reason is that you would no longer be able to do splitting over lists of elements with no Eq instance, even if you are using a predicate. For example, it is currently possible to do fs = [(+1), (subtract 7), (*6)] fs' = splitWhen (\f -> f 7 == 0) fs but this would be no longer possible with your patch. This is an admittedly contrived example, but on principle I don't want to unnecessarily restrict the API in this way. -Brent

On Sat, Jul 21, 2012 at 2:34 AM, Brent Yorgey
You are right, actually, only ExistentialQuantification is necessary, as long as we also stop using GADT syntax. I didn't realize before that this syntax is accepted:
{-# LANGUAGE ExistentialQuantification #-}
data Delimiter a = DelimEltPred (a -> Bool) | Eq a => DelimSublist [a]
I do agree that this is a bit weird, what's going on here is not exactly existential quantification. But in any case the ExistentialQuantification extension turns on this ability to embed class constraints in data constructors -- at least in GHC.
GADTs and ExistentialQuantification are pretty similar. There's only two differences: - Syntax - GADTs enable equality constraints, ExistentialQuantification does not But if you have equality constraints from somewhere else (say, TypeFamilies) then ExistentialQuantification is equivalent to GADTs. An unrelated suggestion: you can give type signatures to the various functions which are synonyms of each other as a group and they will show up as a single item in the Haddocks. For example, instead of -- | some docs splitOn :: Eq a => [a] -> [a] -> [[a]] -- | some other docs sepBy :: Eq a => [a] -> [a] -> [[a]] -- | different docs unintercalate :: Eq a => [a] -> [a] -> [[a]] you can have -- | one and only docs splitOn, sepBy, unintercalate :: Eq a => [a] -> [a] -> [[a]] I don't know if you consider this an improvement. I think I do. -- Your ship was caught in a monadic eruption.

An unrelated suggestion: you can give type signatures to the various functions which are synonyms of each other as a group and they will show up as a single item in the Haddocks.
Have you tried this a recent version of Haddock? I think this was only in the Haddock version that was released with GHC 7.2 (introduced due to a change in GHC). But it had issues with explicit export lists, so we now always expand such signatures. Cheers, Simon

On Sat, Jul 21, 2012 at 2:13 PM, Simon Hengel
An unrelated suggestion: you can give type signatures to the various functions which are synonyms of each other as a group and they will show up as a single item in the Haddocks.
Have you tried this a recent version of Haddock? I think this was only in the Haddock version that was released with GHC 7.2 (introduced due to a change in GHC). But it had issues with explicit export lists, so we now always expand such signatures.
Cheers, Simon
It seems to work with Haddock 2.10 / GHC 7.4. I remember I initially tried it with whatever old version of Haddock I had installed and I was annoyed that it didn't work (it got split into two declarations with one of them missing documentation entirely -- quite suboptimal), but then following a suggestion I tried it again with GHC 7.4 and was happy that it seemed to have improved and gave the expected behaviour, and impressively even worked across module boundaries. I would be sad if it stopped working again. :) Some examples (which Hackage built using 7.4): http://hackage.haskell.org/packages/archive/repa/3.2.1.1/doc/html/Data-Array... (append, (++)) http://hackage.haskell.org/packages/archive/type-eq/0.1.2/doc/html/Type-Eq.h... (cast, (|>), TypeEq(..)) -- Your ship was caught in a monadic eruption.

On Sat, Jul 21, 2012 at 04:37:48PM +0200, Gábor Lehel wrote:
On Sat, Jul 21, 2012 at 2:13 PM, Simon Hengel
wrote: An unrelated suggestion: you can give type signatures to the various functions which are synonyms of each other as a group and they will show up as a single item in the Haddocks.
Have you tried this a recent version of Haddock? I think this was only in the Haddock version that was released with GHC 7.2 (introduced due to a change in GHC). But it had issues with explicit export lists, so we now always expand such signatures.
Cheers, Simon
It seems to work with Haddock 2.10 / GHC 7.4. I remember I initially tried it with whatever old version of Haddock I had installed and I was annoyed that it didn't work (it got split into two declarations with one of them missing documentation entirely -- quite suboptimal), but then following a suggestion I tried it again with GHC 7.4 and was happy that it seemed to have improved and gave the expected behaviour, and impressively even worked across module boundaries. I would be sad if it stopped working again. :)
Indeed, from the CHANGES file in the haddock distribution: Changes in version 2.9.3: * A type signature for multiple names generates one signature in the output Neato, I didn't know about this feature. I think grouping synonyms together like this is a good idea. -Brent

(CCing David Waern, Haddocks maintainer) On Sat, Jul 21, 2012 at 04:37:48PM +0200, Gábor Lehel wrote:
On Sat, Jul 21, 2012 at 2:13 PM, Simon Hengel
wrote: An unrelated suggestion: you can give type signatures to the various functions which are synonyms of each other as a group and they will show up as a single item in the Haddocks.
Have you tried this a recent version of Haddock? I think this was only in the Haddock version that was released with GHC 7.2 (introduced due to a change in GHC). But it had issues with explicit export lists, so we now always expand such signatures.
Cheers, Simon
It seems to work with Haddock 2.10 / GHC 7.4.
Oh, then the patch did not make it into Haddock 2.10 / GHC 7.4.1. But Haddock 2.11 (which comes with GHC 7.4.2) expands such signatures. The corresponding ticket is at [1].
I remember I initially tried it with whatever old version of Haddock I had installed and I was annoyed that it didn't work
AFAIK, prior to GHC 7.2, ghc did not retain that information in the AST (see [2]).
(it got split into two declarations with one of them missing documentation entirely -- quite suboptimal),
Maybe some solace, you now at least get the documentation on all the declarations ;)
I would be sad if it stopped working again. :)
Some examples (which Hackage built using 7.4): http://hackage.haskell.org/packages/archive/repa/3.2.1.1/doc/html/Data-Array... (append, (++))
Looks like a valid use case. I never thought about that, but if you have synonyms for a function, it really makes sense. (Personally, I do not really like synonyms, but for this particularly case even that makes sense to me.) The issue was with stuff like: Module Foo ( -- * Foo foo -- * Bar , bar -- * Baz , baz ) where foo, bar, baz :: Int ... Or what, if you change the order of identifiers in the export list? There is an other case that would need special treatment. We now include deprecation messages for deprecated stuff in documentation. So for the following example the documentation for `foo` and `bar` would be different: -- | Documentation for `foo` and `bar`. foo, bar :: Int {-# DEPRECATED foo "use `bar` instead" #-} ... Cheers, Simon [1] http://trac.haskell.org/haddock/ticket/192 [2] http://hackage.haskell.org/trac/ghc/ticket/1595

On 2012-07-21 02:34, Brent Yorgey wrote:
You are right, actually, only ExistentialQuantification is necessary, as long as we also stop using GADT syntax. I didn't realize before that this syntax is accepted:
{-# LANGUAGE ExistentialQuantification #-}
data Delimiter a = DelimEltPred (a -> Bool) | Eq a => DelimSublist [a]
Would the following type work? data Delimiter a = DelimEltPred (a -> Bool) | DelimSublistPred [a -> Bool] You can go from the current DelimSublist to DelimSublistPred with just `map (==)`. And is the distinction between the DelimEltPred and DelimSublistPred then still needed at all? Twan

On 2012-07-21 14:30, Twan van Laarhoven wrote:
Would the following type work?
data Delimiter a = DelimEltPred (a -> Bool) | DelimSublistPred [a -> Bool]
You can go from the current DelimSublist to DelimSublistPred with just `map (==)`. And is the distinction between the DelimEltPred and DelimSublistPred then still needed at all?
I attached some code that uses the simplified Delimiter type. As an aside, the documentation for splitPlaces contains this unhelpful remark:
The behavior of splitPlaces ls xs when sum ls /= length xs can be inferred from the above examples and the fact that splitPlaces is total.
To me that reads like "the documentation of this function is left as an exercise to the reader". Perhaps say something like: If the input list is longer than the total of the given lengths, then the remaining elements are dropped. If the list is shorter than the total of the given lengths, then the result may contain fewer chunks, and the last chunk may be shorter. While `splitPlacesBlanks` could say something like: If the input list is longer than the total of the given lengths, then the remaining elements are dropped. If the list is shorter than the total of the given lengths, then the last several chunk will be shorter or empty. Twan

On Sat, Jul 21, 2012 at 02:59:45PM +0200, Twan van Laarhoven wrote:
On 2012-07-21 14:30, Twan van Laarhoven wrote:
Would the following type work?
data Delimiter a = DelimEltPred (a -> Bool) | DelimSublistPred [a -> Bool]
You can go from the current DelimSublist to DelimSublistPred with just `map (==)`. And is the distinction between the DelimEltPred and DelimSublistPred then still needed at all?
I attached some code that uses the simplified Delimiter type.
Brilliant! Yes, this is a big improvement, and does indeed allow getting rid of the ExistentialQuantification (or GADTs) extension. I will make this change for sure, regardless of the outcome of the review process.
As an aside, the documentation for splitPlaces contains this unhelpful remark:
The behavior of splitPlaces ls xs when sum ls /= length xs can be inferred from the above examples and the fact that splitPlaces is total.
To me that reads like "the documentation of this function is left as an exercise to the reader". Perhaps say something like:
If the input list is longer than the total of the given lengths, then the remaining elements are dropped. If the list is shorter than the total of the given lengths, then the result may contain fewer chunks, and the last chunk may be shorter.
While `splitPlacesBlanks` could say something like:
If the input list is longer than the total of the given lengths, then the remaining elements are dropped. If the list is shorter than the total of the given lengths, then the last several chunk will be shorter or empty.
Ah, yes, I agree. I must have been in somewhat of a cheeky mood when I wrote that. I will improve the documentation, thanks for the suggestions. -Brent

Am 21.07.2012 14:30, schrieb Twan van Laarhoven:
Would the following type work?
data Delimiter a = DelimEltPred (a -> Bool) | DelimSublistPred [a -> Bool]
You can go from the current DelimSublist to DelimSublistPred with just `map (==)`. And is the distinction between the DelimEltPred and DelimSublistPred then still needed at all?
Does it not make sense then, to reuse the DelimSublistPred variant with a singleton list for the DelimEltPred case? (I suppose empty lists must be excluded anyway). newtype Delimiter a = Delimiter [a -> Bool] C.
Twan

On Mon, Jul 23, 2012 at 11:11:19AM +0200, Christian Maeder wrote:
Am 21.07.2012 14:30, schrieb Twan van Laarhoven:
Would the following type work?
data Delimiter a = DelimEltPred (a -> Bool) | DelimSublistPred [a -> Bool]
You can go from the current DelimSublist to DelimSublistPred with just `map (==)`. And is the distinction between the DelimEltPred and DelimSublistPred then still needed at all?
Does it not make sense then, to reuse the DelimSublistPred variant with a singleton list for the DelimEltPred case? (I suppose empty lists must be excluded anyway).
newtype Delimiter a = Delimiter [a -> Bool]
Yes, in fact, Twan sent another message with some code using this exact definition of Delimiter, and I have already made this change to the library. -Brent

+1. People often reinvent splitting utilities because adding dependency on a package for simple functions like these seems overkill. Hopefully, inclusion into HP will help it and make the overhead smaller. Regarding the API: I'm a bit concerned with presence of synonyms there (e.g. chunk = splitEvery, sepBy = splitOn etc.) IMO it makes it harder to learn the API (which is not small even without the synonyms), and especially to read other people's code if their preferences in naming differ from yours. -- Roman I. Cheplyaka :: http://ro-che.info/

On Sat, Jul 21, 2012 at 09:24:16AM +0300, Roman Cheplyaka wrote:
Regarding the API: I'm a bit concerned with presence of synonyms there (e.g. chunk = splitEvery, sepBy = splitOn etc.) IMO it makes it harder to learn the API (which is not small even without the synonyms), and especially to read other people's code if their preferences in naming differ from yours.
Would your concern be addressed by Gábor's suggestion to group synonyms together in the documentation? -Brent

* Brent Yorgey
On Sat, Jul 21, 2012 at 09:24:16AM +0300, Roman Cheplyaka wrote:
Regarding the API: I'm a bit concerned with presence of synonyms there (e.g. chunk = splitEvery, sepBy = splitOn etc.) IMO it makes it harder to learn the API (which is not small even without the synonyms), and especially to read other people's code if their preferences in naming differ from yours.
Would your concern be addressed by Gábor's suggestion to group synonyms together in the documentation?
This could be a minor improvement (I haven't checked how these grouped functions look like in haddock), but I don't see the need for synonyms in the first place. Could you maybe explain the motivation behind them? -- Roman I. Cheplyaka :: http://ro-che.info/

On Sat, Jul 21, 2012 at 11:40:19PM +0300, Roman Cheplyaka wrote:
* Brent Yorgey
[2012-07-21 08:09:09-0400] On Sat, Jul 21, 2012 at 09:24:16AM +0300, Roman Cheplyaka wrote:
Regarding the API: I'm a bit concerned with presence of synonyms there (e.g. chunk = splitEvery, sepBy = splitOn etc.) IMO it makes it harder to learn the API (which is not small even without the synonyms), and especially to read other people's code if their preferences in naming differ from yours.
Would your concern be addressed by Gábor's suggestion to group synonyms together in the documentation?
This could be a minor improvement (I haven't checked how these grouped functions look like in haddock), but I don't see the need for synonyms in the first place. Could you maybe explain the motivation behind them?
Certainly. The idea is to provide synonyms whenever there are multiple common names in use, as well as a consistent system of names within the package itself. The goal is for new users to be able to get started using the library as quickly as possible -- users will usually come looking for some particular function and they may already have an idea about what it might be called. To be concrete, the split package has three sets of synonyms: * splitOn / sepBy / unintercalate Here 'splitOn' is an internally consistent name, which matches with the naming scheme used in the rest of the package. 'sepBy' is a name from parsec and other parser combinator libraries; 'unintercalate' emphasizes that this function is right inverse to 'intercalate'. * splitOneOf / sepByOneOf * splitEvery / chunk Again, 'splitEvery' matches the internal naming scheme; 'chunk' is a name commonly used for this function within the community. I don't see much harm in this (modulo making the documentation clearer, which I plan to do). And I really don't want to *remove* existing names because that would force a major version bump and potentially break any code depending on split. -Brent

Firstly, I agree that adding splitting functionality to Platform would be very useful - so thanks for all the work you've done. On the naming side of things, I'd just like to say I agree with Roman here - I think having synonyms in an API is a bad idea. (It's one of the things I've disliked most about Coq). I think there might be an argument that synonyms improve things for those who know the API really well - but that comes at the cost of worsening the experience for those (i.e. most) who know it moderately or fairly well and thus - when reading code written by others - find themselves struggling to remember whether the variants that they don't use personally are subtly different or not. Anyway, I'd be strongly in favour of removing the synonyms (or at least side-lining them into a separate not-exported-by-default module) before adding to the platform. --Ben P.S. I think "intercalate" is an awful name, and "unintercalate" is certainly no better ;-) - so I'd be in favour of choosing one of the other two. On 22 Jul 2012, at 01:43, Brent Yorgey wrote:
On Sat, Jul 21, 2012 at 11:40:19PM +0300, Roman Cheplyaka wrote:
* Brent Yorgey
[2012-07-21 08:09:09-0400] On Sat, Jul 21, 2012 at 09:24:16AM +0300, Roman Cheplyaka wrote:
Regarding the API: I'm a bit concerned with presence of synonyms there (e.g. chunk = splitEvery, sepBy = splitOn etc.) IMO it makes it harder to learn the API (which is not small even without the synonyms), and especially to read other people's code if their preferences in naming differ from yours.
Would your concern be addressed by Gábor's suggestion to group synonyms together in the documentation?
This could be a minor improvement (I haven't checked how these grouped functions look like in haddock), but I don't see the need for synonyms in the first place. Could you maybe explain the motivation behind them?
Certainly. The idea is to provide synonyms whenever there are multiple common names in use, as well as a consistent system of names within the package itself. The goal is for new users to be able to get started using the library as quickly as possible -- users will usually come looking for some particular function and they may already have an idea about what it might be called.
To be concrete, the split package has three sets of synonyms:
* splitOn / sepBy / unintercalate
Here 'splitOn' is an internally consistent name, which matches with the naming scheme used in the rest of the package. 'sepBy' is a name from parsec and other parser combinator libraries; 'unintercalate' emphasizes that this function is right inverse to 'intercalate'.
* splitOneOf / sepByOneOf
* splitEvery / chunk
Again, 'splitEvery' matches the internal naming scheme; 'chunk' is a name commonly used for this function within the community.
I don't see much harm in this (modulo making the documentation clearer, which I plan to do). And I really don't want to *remove* existing names because that would force a major version bump and potentially break any code depending on split.
-Brent
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Ben Moseley
On the naming side of things, I'd just like to say I agree with Roman here - I think having synonyms in an API is a bad idea.
Agreed
I think there might be an argument that synonyms improve things for those who know the API really well - but that comes at the cost of worsening the experience for those (i.e. most) who know it moderately or fairly well and thus - when reading code written by others - find themselves struggling to remember whether the variants that they don't use personally are subtly different or not.
Exactly. And it’s very easy for someone who wants a synonym to define it. There’s an issue in this case of avoiding breaking code, but including them, deprecating them and hiding them from the documentation seems to solve that.
P.S. I think "intercalate" is an awful name,
What have you got against it? For the record I was against introducing a name for such a short function (argument is similar to that against synonyms), but it does mean exactly the right thing.
and "unintercalate" is certainly no better ;-)
I’ll grant that that is very ugly. -- Jón Fairbairn Jon.Fairbairn@cl.cam.ac.uk

A suggestion: perhaps these synonyms should be added as Haddock comments. Thus someone Ctrl-F'ing would be able to find the canonical functions even if they don't know their name. However, someone Hayoo'ing wouldn't. Cheers, -- Felipe.

Hi, not sure if that adds anything to the discussion, but I share Roman's concerns. On Sat, Jul 21, 2012 at 08:43:05PM -0400, Brent Yorgey wrote:
On Sat, Jul 21, 2012 at 11:40:19PM +0300, Roman Cheplyaka wrote:
* Brent Yorgey
[2012-07-21 08:09:09-0400] On Sat, Jul 21, 2012 at 09:24:16AM +0300, Roman Cheplyaka wrote:
Regarding the API: I'm a bit concerned with presence of synonyms there (e.g. chunk = splitEvery, sepBy = splitOn etc.) IMO it makes it harder to learn the API (which is not small even without the synonyms), and especially to read other people's code if their preferences in naming differ from yours.
Would your concern be addressed by Gábor's suggestion to group synonyms together in the documentation?
This could be a minor improvement (I haven't checked how these grouped functions look like in haddock), but I don't see the need for synonyms in the first place. Could you maybe explain the motivation behind them?
Certainly. The idea is to provide synonyms whenever there are multiple common names in use, as well as a consistent system of names within the package itself. The goal is for new users to be able to get started using the library as quickly as possible -- users will usually come looking for some particular function and they may already have an idea about what it might be called.
This was at least not true for me. When I first looked at split, I was utterly confused about all those synonyms. This was one of two reasons why I decided not to use it (having an other dependency was the other).
To be concrete, the split package has three sets of synonyms:
* splitOn / sepBy / unintercalate
Here 'splitOn' is an internally consistent name, which matches with the naming scheme used in the rest of the package. 'sepBy' is a name from parsec and other parser combinator libraries; 'unintercalate' emphasizes that this function is right inverse to 'intercalate'.
* splitOneOf / sepByOneOf
* splitEvery / chunk
Again, 'splitEvery' matches the internal naming scheme; 'chunk' is a name commonly used for this function within the community.
I don't see much harm in this (modulo making the documentation clearer, which I plan to do).
Making the documentation clearer would mitigate Roman's first concern (learning the API). But it would not address the second concern (reading/understanding other peoples code), which to me is even more important.
And I really don't want to *remove* existing names because that would force a major version bump and potentially break any code depending on split.
I think this particular point can be addressed by either deprecating synonyms; or by removing the Haddock comment from synonyms and use: {-# OPTIONS_HADDOCK prune #-} That way the synonyms do not show up in documentation, but are still exported. Cheers, Simon

I haven't reviewed the technical content of the proposal yet (beyond having used split in prior projects), but I generally approve of having a split-like package in the Haskell Platform. Cheers, Edward Excerpts from Brent Yorgey's message of Fri Jul 20 16:35:57 -0400 2012:
Hello everyone,
This is a proposal for the split package [1] to be included in the next major release of the Haskell platform.
Everyone is invited to review this proposal, following the standard procedure [2] for proposing and reviewing packages.
Review comments should be sent to the libraries mailing list by August 20 (arbitrarily chosen; there's plenty of time before the October 1 deadline [3]). The Haskell Platform wiki will be kept up-to-date with the results of the review process:
http://trac.haskell.org/haskell-platform/wiki/Proposals/split
[1] http://hackage.haskell.org/package/split [2] http://trac.haskell.org/haskell-platform/wiki/AddingPackages [3] http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable
Credits =======
Proposal author and package maintainer: Brent Yorgey <byorgey at cis.upenn.edu>
Abstract ========
The Data.List.Split module contains a wide range of strategies for splitting lists with respect to some sort of delimiter, mostly implemented through a unified combinator interface. The goal is to be a flexible yet simple alternative to the standard 'split' function found in some other mainstream languages.
Documentation and tarball from the hackage page:
http://hackage.haskell.org/package/split
Development repo:
darcs get http://code.haskell.org/~byorgey/code/split
Rationale =========
Splitting a list into chunks based on some sort of delimiter(s) is a common need, and is provided in the standard libraries of several mainstream languages (e.g. Python [4], Ruby [5], Java [6]). Haskell beginners routinely ask whether such a function exists in the standard libraries. For a long time, the answer was no. Adding such a function to Haskell's standard libraries has been proposed multiple times over the years, but consensus was never reached on the design of such a function. (See, e.g. [7, 8, 9].)
[4] http://docs.python.org/py3k/library/stdtypes.html?highlight=split#str.split [5] http://www.ruby-doc.org/core-1.9.3/String.html#method-i-split [6] http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.la...) [7] http://www.haskell.org/pipermail/libraries/2006-July/005504.html [8] http://www.haskell.org/pipermail/libraries/2006-October/006072.html [9] http://www.haskell.org/pipermail/libraries/2008-January/008922.html
In December 2008 the split package was released, implementing not just a single split method, but a wide range of splitting strategies.
Since then the split package has gained wide acceptance, with almost 95 reverse dependencies [10], putting it in the top 40 for number of reverse dependencies on Hackage.
[10] http://packdeps.haskellers.com/reverse/split
The package is quite stable. Since the 0.1.4 release in April 2011 only very minor updates have been made. It has a large suite of QuickCheck properties [11]; to my recollection no bugs have ever been reported.
[11] http://code.haskell.org/~byorgey/code/split/Properties.hs
API ===
For a detailed description of the package API and example usage, see the Haddock documentation:
http://hackage.haskell.org/packages/archive/split/0.1.4.3/doc/html/Data-List...
Design decisions ================
Most of the library is based around a (rather simple) combinator interface. Combinators are used to build up configuration records (recording options such as whether to keep delimiters, whether to keep blank segments, etc). A configuration record is finally handed off to a function which performs a generic maximally-information-preserving splitting algorithm and then does various postprocessing steps (based on the configuration) to selectively throw information away. It is probably not the fastest way to implement these methods, but speed is explicitly not a design goal: the aim is to provide a reasonably wide range of splitting strategies which can be used simply. Blazing speed (or more complex processing), when needed, can be obtained from a proper parsing package.
Open issues ===========
Use of GHC.Exts ---------------
At the request of a user, the 0.1.4.3 release switched from defining its own version of the standard 'build' function, to importing it from GHC.Exts. This allows GHC to do more optimization, resulting in reported speedups to uses of splitEvery, splitPlaces, and splitPlacesBlanks. However, this makes the library GHC-specific. If any reviewers think this is an issue I would be willing to go back to defining build by hand, or use CPP macros to select between build implementations based on the compiler.
Missing strategies

On Fri, Jul 20, 2012 at 1:35 PM, Brent Yorgey
Everyone is invited to review this proposal, following the standard procedure [2] for proposing and reviewing packages.
My comments: I'm generally in favour of this, as the split library is quite nice. A couple of more specific notes follow. Would you have any interest in abstracting this library over other particularly common list-like sequences? There's obviously a need for similar functionality with e.g. the text and bytestring libraries (maybe vector too, though I'm more dubious), and proliferating this API into those packages doesn't seem like the best way to go. Speaking of the text library, it already provides a function named chunksOf that is identical to your splitEvery: http://hackage.haskell.org/packages/archive/text/0.11.2.2/doc/html/Data-Text... For consistency with an existing HP library, I think you should rename splitEvery to chunksOf.

On Mon, Jul 23, 2012 at 09:22:56AM -0700, Bryan O'Sullivan wrote:
On Fri, Jul 20, 2012 at 1:35 PM, Brent Yorgey
wrote: Everyone is invited to review this proposal, following the standard procedure [2] for proposing and reviewing packages.
My comments:
I'm generally in favour of this, as the split library is quite nice. A couple of more specific notes follow.
Would you have any interest in abstracting this library over other particularly common list-like sequences? There's obviously a need for similar functionality with e.g. the text and bytestring libraries (maybe vector too, though I'm more dubious), and proliferating this API into those packages doesn't seem like the best way to go.
I don't really have the time or inclination to do that. But if someone else wanted to work on this, I would certainly not be against it.
Speaking of the text library, it already provides a function named chunksOf that is identical to your splitEvery:
http://hackage.haskell.org/packages/archive/text/0.11.2.2/doc/html/Data-Text...
For consistency with an existing HP library, I think you should rename splitEvery to chunksOf.
*grumble grumble*. The first version of split was released three whole months before the first version of text, so how come text didn't use the existing standard name? ;-) ...OK, OK, fine. =) -Brent

On 20/07/12 22:35, Brent Yorgey wrote:
Hello everyone,
This is a proposal for the split package [1] to be included in the next major release of the Haskell platform.
Missing strategies ------------------
The specific way that the generic splitting algorithm is implemented does preclude some imaginable splitting strategies. For example, a few years ago I tried adding a strategy that used a predicate on pairs of elements, splitting down the middle of any pairs that satisfy the predicate, but gave up because it simply did not fit.
I just needed a function for splitting a list into two equal parts. This is something you often need for MergeSort-like functions. Is that something that belongs in Data.List.Split? I came up with this code, I am sure that there are other possible implementations. -- | Split a list into two equally sized parts. -- If the number of elements in the list is odd, -- then the first part will be one element longer. splitMiddle :: [a] -> ([a],[a]) splitMiddle xs = let (_,ys,zs) = go 0 xs in (ys,zs) where go _ [] = (0,[],[]) -- note: should be made strict in 1st argument go i xxs@(x:xs) | i <= j = (j+1,x:ys,zs) | otherwise = (j+1,ys,xxs) where (j,ys,zs) = go (i+1) xs Twan

Twan van Laarhoven wrote:
I just needed a function for splitting a list into two equal parts. This is something you often need for MergeSort-like functions. Is that something that belongs in Data.List.Split?
If you don't care about the order of elements (which merge sort doesn't), then this will do the trick: foldr (\x (ys,zs) -> (zs,x:ys)) ([],[]). Roman

On Tue, Jul 24, 2012 at 03:13:08PM +0200, Twan van Laarhoven wrote:
On 20/07/12 22:35, Brent Yorgey wrote:
Hello everyone,
This is a proposal for the split package [1] to be included in the next major release of the Haskell platform.
Missing strategies ------------------
The specific way that the generic splitting algorithm is implemented does preclude some imaginable splitting strategies. For example, a few years ago I tried adding a strategy that used a predicate on pairs of elements, splitting down the middle of any pairs that satisfy the predicate, but gave up because it simply did not fit.
I just needed a function for splitting a list into two equal parts. This is something you often need for MergeSort-like functions. Is that something that belongs in Data.List.Split?
Perhaps, though I think this sort of thing should probably be handled outside the HP review process. Patches welcome. =) -Brent
participants (13)
-
Ben Moseley
-
Brent Yorgey
-
Bryan O'Sullivan
-
Christian Maeder
-
Edward Z. Yang
-
Felipe Almeida Lessa
-
Gábor Lehel
-
Henning Thielemann
-
Jon Fairbairn
-
Roman Cheplyaka
-
Roman Leshchinskiy
-
Simon Hengel
-
Twan van Laarhoven