Proposal: Add split and splitWith (trac #2048)

Hello Haskellers, An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)
splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs
trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30 Twan

Hi
An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)
splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs
trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30
(+10) Agreement. This is long overdue! Thanks Neil

ndmitchell:
Hi
An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)
splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs
trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30
(+10) Agreement. This is long overdue!
I'd like some QuickCheck properties, so we can avoid the unlines . lines /= id fiasco. Given that, I'd support this too. -- Don

Don Stewart wrote:
I'd like some QuickCheck properties, so we can avoid the unlines . lines /= id fiasco. Given that, I'd support this too.
What is a good place to put those? I just ran some obvious tests from ghci:
test (\x y -> intercalate [x::Int] (split x y) == y) ] OK, passed 100 tests.
test (\p xs -> all (not . any p) (splitWith p xs :: [[Int]])) ] OK, passed 100 tests.
for the unbeliever:
test (\x y -> splitWith (==x) y == split (x::Int) y) ] OK, passed 100 tests.
to show lazyness:
split 'a' $ concat $ repeat "abcdefgh" ] ["","bcdefgh","bcdefgh","bcdefgh",...
split 'z' $ concat $ repeat "abcdefgh" ] ["abcdefghabcdefghabcdefghabcdefgh...
Fiasco avoided :) Twan

Hi Twan, On Thu, Jan 17, 2008 at 12:00:37AM +0100, Twan van Laarhoven wrote:
An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)
splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs
One or the other should be changed so that these agree: *Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. [] although I couldn't say which is "right" OTTOMH... Thanks Ian

Ian Lynagh wrote:
One or the other should be changed so that these agree:
*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []
although I couldn't say which is "right" OTTOMH...
I hadn't noticed. In my opinion the Data.List version is more consistent,
split 'a' "xx" == ["xx"] split 'a' "x" == ["x"] split 'a' "" == [""]
and
split 'a' "aa" == ["","",""] split 'a' "a" == ["",""] split 'a' "" == [""]
Versus:
B.split 'a' "xx" == ["xx"] B.split 'a' "x" == ["x"] B.split 'a' "" == [] B.split 'a' "aa" == ["","",""] B.split 'a' "a" == ["",""] B.split 'a' "" == []
Twan

Hi
One or the other should be changed so that these agree:
*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []
although I couldn't say which is "right" OTTOMH...
I hadn't noticed. In my opinion the Data.List version is more consistent,
You can show you can build it up from an inductive argument, yes. But the Data.ByteString version probably matches what I'd expect to happen much more - just a gut feeling. Thanks Neil

ndmitchell:
Hi
One or the other should be changed so that these agree:
*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []
although I couldn't say which is "right" OTTOMH...
I hadn't noticed. In my opinion the Data.List version is more consistent,
You can show you can build it up from an inductive argument, yes. But the Data.ByteString version probably matches what I'd expect to happen much more - just a gut feeling.
It's been a long time since I looked at bytestring's split, and it was never meant to be the last word on how to do this. Here are some of the properties it has, prop_splitsplitWith c xs = split c xs == splitWith (== c) xs prop_joinsplit c xs = intercalate [c] (split c xs) == xs Note here the funniness of lines, complicates the properties: prop_linessplit xs = lines xs == split '\n' xs ++ (if last xs == '\n' then [empty] else []) What properties relating to current List functoins does this split have? (I don't want to block the inclusion, just want to get a sense of how it fits into the existing code). -- Don

Don Stewart wrote:
Note here the funniness of lines, complicates the properties:
prop_linessplit xs = lines xs == split '\n' xs ++ (if last xs == '\n' then [empty] else [])
This property is wrong (i.e. for xs == "\n")
What properties relating to current List functoins does this split have?
Data.ByteString.Char8.split only behaves differently to the proposed Data.List.split function on empty input lists. (as Twan showed) last xs == '\n' ==> lines xs == Data.List.split '\n' (init xs) Christian

Christian Maeder wrote:
Don Stewart wrote:
Note here the funniness of lines, complicates the properties:
[...]
Data.ByteString.Char8.split only behaves differently to the proposed Data.List.split function on empty input lists. (as Twan showed)
last xs == '\n' ==> lines xs == Data.List.split '\n' (init xs)
This property would be wrong for Data.ByteString.Char8.split and "\n". If we ignore the empty input (which is trivial) the following remaining property case is: not (null xs) && last xs /= '\n' ==> Data.List.split '\n' xs = lines (xs ++ "\n") which is true also for Data.ByteString.Char8.split Christian

On Thu, 17 Jan 2008, Ian Lynagh wrote:
On Thu, Jan 17, 2008 at 12:00:37AM +0100, Twan van Laarhoven wrote:
An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)
splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs
One or the other should be changed so that these agree:
*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []
although I couldn't say which is "right" OTTOMH...
List's split is more consistent. It always returns a non-empty list. The number of sub-lists is the number of matching elements plus 1.

On Thu, 17 Jan 2008, Henning Thielemann wrote:
On Thu, 17 Jan 2008, Ian Lynagh wrote:
One or the other should be changed so that these agree:
*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []
although I couldn't say which is "right" OTTOMH...
List's split is more consistent. It always returns a non-empty list. The number of sub-lists is the number of matching elements plus 1.
For QuickCheck: 1 + length (filter p xs) == length (split p xs)

Ian Lynagh wrote:
One or the other should be changed so that these agree:
*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []
although I couldn't say which is "right" OTTOMH...
I've convinced myself (see my previous mails) that Twan's version is right and Data.ByteString.Char8.split should be changed (and all its usages need to be checked)! The documentation in both modules should contain the example! split c [] == [[]] I don't think that the oddity of Data.ByteString.Char8.split on empty input helps to establish the treatment of a missing final newline for the lines function. Christian

On Thu, 2008-01-17 at 17:36 +0100, Christian Maeder wrote:
Ian Lynagh wrote:
One or the other should be changed so that these agree:
*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []
although I couldn't say which is "right" OTTOMH...
I've convinced myself (see my previous mails) that Twan's version is right and Data.ByteString.Char8.split should be changed (and all its usages need to be checked)!
The documentation in both modules should contain the example! split c [] == [[]]
I don't think that the oddity of Data.ByteString.Char8.split on empty input helps to establish the treatment of a missing final newline for the lines function.
If everyone thinks this is the right behaviour we can certainly change Data.ByteString to follow. It is supposed to follow the Data.List api. Duncan

On Thu, 17 Jan 2008, Duncan Coutts wrote:
On Thu, 2008-01-17 at 17:36 +0100, Christian Maeder wrote:
I've convinced myself (see my previous mails) that Twan's version is right and Data.ByteString.Char8.split should be changed (and all its usages need to be checked)!
The documentation in both modules should contain the example! split c [] == [[]]
I don't think that the oddity of Data.ByteString.Char8.split on empty input helps to establish the treatment of a missing final newline for the lines function.
If everyone thinks this is the right behaviour we can certainly change Data.ByteString to follow. It is supposed to follow the Data.List api.
I vote for the change.

Duncan Coutts wrote:
If everyone thinks this is the right behaviour we can certainly change Data.ByteString to follow. It is supposed to follow the Data.List api.
Good, lines can be best shortly expressed in terms of split as follows: mylines s = if null (last l) then init l else l where l = split '\n' s This is safe (though inefficient), because split always returns a non-empty list (in contrast to Data.ByteString.Char8.split)! Furthermore Data.PackedString.splitPS corresponds to lines. Below are some test case. Cheers Christian import qualified Data.ByteString.Char8 as B import qualified Data.PackedString as P *GHCI> l ["","a","\n","aa","a\n","\na","\n\n"] *GHCI> map (intercalate "\n" . split '\n') l ["","a","\n","aa","a\n","\na","\n\n"] *GHCI> map (B.intercalate (B.pack "\n") . B.split '\n' . B.pack) l ["","a","\n","aa","a\n","\na","\n\n"] *GHCI> map (unlines . lines) l ["","a\n","\n","aa\n","a\n","\na\n","\n\n"] *GHCI> map lines l [[],["a"],[""],["aa"],["a"],["","a"],["",""]] *GHCI> map (lines . unlines . lines) l [[],["a"],[""],["aa"],["a"],["","a"],["",""]] *GHCI> map mylines l [[],["a"],[""],["aa"],["a"],["","a"],["",""]] *GHCI> map (P.splitPS '\n' . P.packString) l [[],["a"],[""],["aa"],["a"],["","a"],["",""]] *GHCI> map (B.split '\n' . B.pack) l [[],["a"],["",""],["aa"],["a",""],["","a"],["","",""]] *GHCI> map (split '\n') l [[""],["a"],["",""],["aa"],["a",""],["","a"],["","",""]]

On Thu, 17 Jan 2008, Twan van Laarhoven wrote:
Hello Haskellers,
An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)
splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs
trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30
The type could be more specific. We know that the result list is always non-empty, and that the first element of each sub-list matches 'p', except the leading sub-list. Ideally we had a list type for elements of alternating type, like this one: http://darcs.haskell.org/event-list/src/Data/AlternatingList/List/Uniform.hs Then we could use the signature: splitWith :: (a -> Bool) -> [a] -> AlternatingList.T [a] a We can simulate this type by: splitWith :: (a -> Bool) -> [a] -> ([a], [(a, [a])])

On Thu, 17 Jan 2008, Henning Thielemann wrote:
On Thu, 17 Jan 2008, Twan van Laarhoven wrote:
Hello Haskellers,
An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)
splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs
trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30
The type could be more specific. We know that the result list is always non-empty, and that the first element of each sub-list matches 'p', except the leading sub-list. Ideally we had a list type for elements of alternating type, like this one: http://darcs.haskell.org/event-list/src/Data/AlternatingList/List/Uniform.hs Then we could use the signature: splitWith :: (a -> Bool) -> [a] -> AlternatingList.T [a] a
We can simulate this type by: splitWith :: (a -> Bool) -> [a] -> ([a], [(a, [a])])
Ah, I see that your function filters out the elements, that match 'p'. This simplifies the resulting list type ...

Hello Twan, Thursday, January 17, 2008, 2:00:37 AM, you wrote:
An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:
one more proposal to ruin all the programs that define this function themselves. i wonder whether all who propose to add function or two ever seen MissingH package? it includes a lot of such popular functions and anyone who needs them can install this package (or just borrow code) i make a contra-proposal - fix base library in 6.8 state and add to ghc distribution new libs with all the popular Monad, List and any other functions. this will allow: 1) precisely control which functions are available by importing exact versions of all libs (except of base which anyway will be frozen) 2) use all the new functions regardless of ghc version you are using. otherwise, adding `split` function to the base actually means that noone except for core hackers can use it for a year - because all these changes in base will go in production GHC version at the end of 2008 and because using user-defined split function will automatically make program incompatible with next GHC version overall, i want to use GHC for production, open-source programming and can formalize requirements which will allow to do this: 1) frozen base library interface, except for GHC.* modules 2) if we want to improve some base code, we *duplicate* it into new lib (with modified module names), publish first version of this libarry with exactly original code (and therefore equivalent interfaces) and then start to improve it, publishing newer and newer versions. imagine, for example, that we want to improve Data.Array.*: step 1: create library NewArray with modules Data.NewArray.* copied one-to-one from Data.Array.* and publish it as version 1 step 2: raise NewArray version to 2.0 and start making changes. once we've finished, raise version to 3.0 and keep interfaces for 3.* intact so anyone can import NewArray 3.* and got latest version with exactly the same interface as he used yes, this means that every functionality improved against the base package, should be installed in two versions - one inside base one in new package. but this is very natural taking into account that we can't change base without breaking all the code that relies on it. so, if we want new arrays, Handles or Exceptions - we will need to keep old version in base and add new one in other lib. overall, it should be recommended to not import directly anything from base but use separate libraries instead 3) GHC distribution should include all the popular libs (which, at the last end, should terminate rushes to include "popular functions" into the base!) with *MULTIPLE* versions - i.e. last 1.* version of NewArray, last 2.* version and so on. this will ensure that program developed in year 2007, will continue to compile with newest ghc versions in 2008, 2009 and so on. we can drop library from ghc distribution after, say, 3 years. i also propose that Haskell' ccommittee will decide every year which libs to include into Haskell standard libs set with exact major version. for example: year 2007: BS 1.*, Collections 2.*, HDBC 1.* year 2008: BS 2.*, Collections 2.*, HSQL 1.* year 2009: BS 2.*, Collections 2.*, HSQL 2.* ghc-2009 should include std libs from last 3 years, i.e. BS 1.*/2.*, Collections 2.*, HDBC 1.*, HSQL 1.*/2.* ghc-2010 may drop BS 1.* and HDBC 1.* support and of course should add newer libs from HL-2010 standard. the same should do other haskell compilers. this will significantly improve situation with Haskell standard libraries: 1) Haskell' committee will not need to develop artificial "standard libraries" set - actually, i think it can't and anyway any fixed set will become obsolete next year. libraries are most important part of any language and nowadays we can't develop proper stdlibs set "by committee". it should be made by community and committee should only "sign up" final results 2) every haskell distribution will include some guaranteed minimum of common, up-to-date libraries. any program written using H2009 specifications, will continue to run with any major Haskell compilers for a next 3 years. this should overcome "libraries hell" for mid-sized apps 3) any book or courses teaching Haskell can declare, for example, that it investigates Haskell-2009 and any Haskell2009-compatible compiler will provide both the syntax and libs discussed in the book. it will also mean that when you hire "Haskell2009-certified" specialist, you will be sure that he knows not only the language itself but also basic set of libs, equivalent of STL for C++ defining large standard set of libs was the major source of success for Java/C#/C++ last decade. we can go one step further and join development by community with standardization by committee. it should make Haskell better solution for developing large, long-standing products - by providing RICH set of MODERN STANDARD libs, which are guaranteed to run across compilers and years. criteria of inclusion library in this set are (obvious): 1) popularity (this can be fairly measured by downloads/installations/ user votes) 2) open-source, free license, unix/win and ghc/hugs compatibility -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
i make a contra-proposal - fix base library in 6.8 state and add to ghc distribution new libs with all the popular Monad, List and any other functions. this will allow:
2) if we want to improve some base code, we *duplicate* it into new lib (with modified module names), publish first version of this libarry with exactly original code (and therefore equivalent interfaces) and then start to improve it, publishing newer and newer versions. imagine, for example, that we want to improve Data.Array.*:
step 1: create library NewArray with modules Data.NewArray.* copied one-to-one from Data.Array.* and publish it as version 1
Data.Array is in the external 'array' package now. -- Don

Hi Bulat,
On Jan 17, 2008 6:18 PM, Bulat Ziganshin
step 1: create library NewArray with modules Data.NewArray.* copied one-to-one from Data.Array.* and publish it as version 1
Having words like "new" for the purpose of versioning is quite confusing because a library which is new at some point will eventually become old and then the name is misleading. Versioning doesn't belong in module/function names IMHO. -- Johan

Twan van Laarhoven wrote:
Hello Haskellers,
An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)
splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs
trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30
Those interested in contributing to this discussion might like to review some of the past threads on this topic :-) http://www.haskell.org/pipermail/libraries/2006-July/005504.html http://www.haskell.org/pipermail/libraries/2006-July/005499.html http://www.haskell.org/pipermail/libraries/2006-July/005525.html http://www.haskell.org/pipermail/libraries/2006-October/006072.html and I'm sure there are earlier threads, but the above probably covers enough of the design space. Cheers, Simon

From: libraries-bounces@haskell.org [mailto:libraries-bounces@haskell.org] On Behalf Of Simon Marlow
Twan van Laarhoven wrote:
Hello Haskellers,
An often requested function is 'split', to split a list into parts
Those interested in contributing to this discussion might like to review some of the past threads on this topic :-)
There is also this wiki page: http://haskell.org/haskellwiki/List_function_suggestions (which also contains links to the same discussions that Simon referenced.) Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************
participants (11)
-
Bayley, Alistair
-
Bulat Ziganshin
-
Christian Maeder
-
Don Stewart
-
Duncan Coutts
-
Henning Thielemann
-
Ian Lynagh
-
Johan Tibell
-
Neil Mitchell
-
Simon Marlow
-
Twan van Laarhoven