Proposal: Add split and splitWith (trac #2048)

Twan van Laarhoven

16 Jan 2008 16 Jan '08

11 p.m.

Hello Haskellers, An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:

...

split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)

splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs

trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30 Twan

Show replies by date

Neil Mitchell

16 Jan 16 Jan

11:29 p.m.

...

An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:

...
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)

splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs

trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30

(+10) Agreement. This is long overdue! Thanks Neil

Don Stewart

11:32 p.m.

ndmitchell:

...

Hi

...
An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:

...
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)

splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs

trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30

(+10) Agreement. This is long overdue!

I'd like some QuickCheck properties, so we can avoid the unlines . lines /= id fiasco. Given that, I'd support this too. -- Don

Twan van Laarhoven

11:50 p.m.

Don Stewart wrote:

...

I'd like some QuickCheck properties, so we can avoid the unlines . lines /= id fiasco. Given that, I'd support this too.

What is a good place to put those? I just ran some obvious tests from ghci:

...

test (\x y -> intercalate [x::Int] (split x y) == y) ] OK, passed 100 tests.

...

test (\p xs -> all (not . any p) (splitWith p xs :: [[Int]])) ] OK, passed 100 tests.

for the unbeliever:

...

test (\x y -> splitWith (==x) y == split (x::Int) y) ] OK, passed 100 tests.

to show lazyness:

...

split 'a' $ concat $ repeat "abcdefgh" ] ["","bcdefgh","bcdefgh","bcdefgh",...

...

split 'z' $ concat $ repeat "abcdefgh" ] ["abcdefghabcdefghabcdefghabcdefgh...

Fiasco avoided :) Twan

Ian Lynagh

17 Jan 17 Jan

midnight

Hi Twan, On Thu, Jan 17, 2008 at 12:00:37AM +0100, Twan van Laarhoven wrote:

...

An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:

...
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)

splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs

One or the other should be changed so that these agree: *Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. [] although I couldn't say which is "right" OTTOMH... Thanks Ian

Twan van Laarhoven

12:10 a.m.

Ian Lynagh wrote:

...

One or the other should be changed so that these agree:

*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []

although I couldn't say which is "right" OTTOMH...

I hadn't noticed. In my opinion the Data.List version is more consistent,

...

split 'a' "xx" == ["xx"] split 'a' "x" == ["x"] split 'a' "" == [""]

and

...

split 'a' "aa" == ["","",""] split 'a' "a" == ["",""] split 'a' "" == [""]

Versus:

...

B.split 'a' "xx" == ["xx"] B.split 'a' "x" == ["x"] B.split 'a' "" == [] B.split 'a' "aa" == ["","",""] B.split 'a' "a" == ["",""] B.split 'a' "" == []

Twan

Neil Mitchell

12:32 a.m.

...

...
One or the other should be changed so that these agree:

*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []

although I couldn't say which is "right" OTTOMH...

I hadn't noticed. In my opinion the Data.List version is more consistent,

You can show you can build it up from an inductive argument, yes. But the Data.ByteString version probably matches what I'd expect to happen much more - just a gut feeling. Thanks Neil

Don Stewart

12:37 a.m.

ndmitchell:

...

Hi

...
...
One or the other should be changed so that these agree:

*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []

although I couldn't say which is "right" OTTOMH...

I hadn't noticed. In my opinion the Data.List version is more consistent,

You can show you can build it up from an inductive argument, yes. But the Data.ByteString version probably matches what I'd expect to happen much more - just a gut feeling.

It's been a long time since I looked at bytestring's split, and it was never meant to be the last word on how to do this. Here are some of the properties it has, prop_splitsplitWith c xs = split c xs == splitWith (== c) xs prop_joinsplit c xs = intercalate [c] (split c xs) == xs Note here the funniness of lines, complicates the properties: prop_linessplit xs = lines xs == split '\n' xs ++ (if last xs == '\n' then [empty] else []) What properties relating to current List functoins does this split have? (I don't want to block the inclusion, just want to get a sense of how it fits into the existing code). -- Don

Christian Maeder

3:31 p.m.

Don Stewart wrote:

...

Note here the funniness of lines, complicates the properties:

prop_linessplit xs = lines xs == split '\n' xs ++ (if last xs == '\n' then [empty] else [])

This property is wrong (i.e. for xs == "\n")

...

What properties relating to current List functoins does this split have?

Data.ByteString.Char8.split only behaves differently to the proposed Data.List.split function on empty input lists. (as Twan showed) last xs == '\n' ==> lines xs == Data.List.split '\n' (init xs) Christian

Christian Maeder

3:51 p.m.

Christian Maeder wrote:

...

Don Stewart wrote:

...
Note here the funniness of lines, complicates the properties:

[...]

...

Data.ByteString.Char8.split only behaves differently to the proposed Data.List.split function on empty input lists. (as Twan showed)

last xs == '\n' ==> lines xs == Data.List.split '\n' (init xs)

This property would be wrong for Data.ByteString.Char8.split and "\n". If we ignore the empty input (which is trivial) the following remaining property case is: not (null xs) && last xs /= '\n' ==> Data.List.split '\n' xs = lines (xs ++ "\n") which is true also for Data.ByteString.Char8.split Christian

Henning Thielemann

6:53 a.m.

On Thu, 17 Jan 2008, Ian Lynagh wrote:

...

On Thu, Jan 17, 2008 at 12:00:37AM +0100, Twan van Laarhoven wrote:

...
An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:

...
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)

splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs

One or the other should be changed so that these agree:

*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []

although I couldn't say which is "right" OTTOMH...

List's split is more consistent. It always returns a non-empty list. The number of sub-lists is the number of matching elements plus 1.

Henning Thielemann

6:54 a.m.

On Thu, 17 Jan 2008, Henning Thielemann wrote:

...

On Thu, 17 Jan 2008, Ian Lynagh wrote:

...
One or the other should be changed so that these agree:

*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []

although I couldn't say which is "right" OTTOMH...

List's split is more consistent. It always returns a non-empty list. The number of sub-lists is the number of matching elements plus 1.

For QuickCheck: 1 + length (filter p xs) == length (split p xs)

Christian Maeder

4:36 p.m.

Ian Lynagh wrote:

...

One or the other should be changed so that these agree:

*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []

although I couldn't say which is "right" OTTOMH...

I've convinced myself (see my previous mails) that Twan's version is right and Data.ByteString.Char8.split should be changed (and all its usages need to be checked)! The documentation in both modules should contain the example! split c [] == [[]] I don't think that the oddity of Data.ByteString.Char8.split on empty input helps to establish the treatment of a missing final newline for the lines function. Christian

Duncan Coutts

10:28 p.m.

On Thu, 2008-01-17 at 17:36 +0100, Christian Maeder wrote:

...

Ian Lynagh wrote:

...
One or the other should be changed so that these agree:

*Main> split 'a' "" [""] *Main> Data.ByteString.Char8.split 'a' (Data.ByteString.Char8.pack "") Loading package array-0.1.0.0 ... linking ... done. Loading package bytestring-0.9.0.1 ... linking ... done. []

although I couldn't say which is "right" OTTOMH...

I've convinced myself (see my previous mails) that Twan's version is right and Data.ByteString.Char8.split should be changed (and all its usages need to be checked)!

The documentation in both modules should contain the example! split c [] == [[]]

I don't think that the oddity of Data.ByteString.Char8.split on empty input helps to establish the treatment of a missing final newline for the lines function.

If everyone thinks this is the right behaviour we can certainly change Data.ByteString to follow. It is supposed to follow the Data.List api. Duncan

Henning Thielemann

18 Jan 18 Jan

7:07 a.m.

On Thu, 17 Jan 2008, Duncan Coutts wrote:

...

On Thu, 2008-01-17 at 17:36 +0100, Christian Maeder wrote:

...
I've convinced myself (see my previous mails) that Twan's version is right and Data.ByteString.Char8.split should be changed (and all its usages need to be checked)!

The documentation in both modules should contain the example! split c [] == [[]]

I don't think that the oddity of Data.ByteString.Char8.split on empty input helps to establish the treatment of a missing final newline for the lines function.

If everyone thinks this is the right behaviour we can certainly change Data.ByteString to follow. It is supposed to follow the Data.List api.

I vote for the change.

Christian Maeder

9:48 a.m.

Duncan Coutts wrote:

...

If everyone thinks this is the right behaviour we can certainly change Data.ByteString to follow. It is supposed to follow the Data.List api.

Good, lines can be best shortly expressed in terms of split as follows: mylines s = if null (last l) then init l else l where l = split '\n' s This is safe (though inefficient), because split always returns a non-empty list (in contrast to Data.ByteString.Char8.split)! Furthermore Data.PackedString.splitPS corresponds to lines. Below are some test case. Cheers Christian import qualified Data.ByteString.Char8 as B import qualified Data.PackedString as P *GHCI> l ["","a","\n","aa","a\n","\na","\n\n"] *GHCI> map (intercalate "\n" . split '\n') l ["","a","\n","aa","a\n","\na","\n\n"] *GHCI> map (B.intercalate (B.pack "\n") . B.split '\n' . B.pack) l ["","a","\n","aa","a\n","\na","\n\n"] *GHCI> map (unlines . lines) l ["","a\n","\n","aa\n","a\n","\na\n","\n\n"] *GHCI> map lines l [[],["a"],[""],["aa"],["a"],["","a"],["",""]] *GHCI> map (lines . unlines . lines) l [[],["a"],[""],["aa"],["a"],["","a"],["",""]] *GHCI> map mylines l [[],["a"],[""],["aa"],["a"],["","a"],["",""]] *GHCI> map (P.splitPS '\n' . P.packString) l [[],["a"],[""],["aa"],["a"],["","a"],["",""]] *GHCI> map (B.split '\n' . B.pack) l [[],["a"],["",""],["aa"],["a",""],["","a"],["","",""]] *GHCI> map (split '\n') l [[""],["a"],["",""],["aa"],["a",""],["","a"],["","",""]]

Henning Thielemann

17 Jan 17 Jan

6:43 a.m.

On Thu, 17 Jan 2008, Twan van Laarhoven wrote:

...

Hello Haskellers,

An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:

...
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)

splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs

trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30

The type could be more specific. We know that the result list is always non-empty, and that the first element of each sub-list matches 'p', except the leading sub-list. Ideally we had a list type for elements of alternating type, like this one: http://darcs.haskell.org/event-list/src/Data/AlternatingList/List/Uniform.hs Then we could use the signature: splitWith :: (a -> Bool) -> [a] -> AlternatingList.T [a] a We can simulate this type by: splitWith :: (a -> Bool) -> [a] -> ([a], [(a, [a])])

Henning Thielemann

6:45 a.m.

On Thu, 17 Jan 2008, Henning Thielemann wrote:

...

On Thu, 17 Jan 2008, Twan van Laarhoven wrote:

...
Hello Haskellers,

An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:

...
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)

splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs

trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30

The type could be more specific. We know that the result list is always non-empty, and that the first element of each sub-list matches 'p', except the leading sub-list. Ideally we had a list type for elements of alternating type, like this one: http://darcs.haskell.org/event-list/src/Data/AlternatingList/List/Uniform.hs Then we could use the signature: splitWith :: (a -> Bool) -> [a] -> AlternatingList.T [a] a

We can simulate this type by: splitWith :: (a -> Bool) -> [a] -> ([a], [(a, [a])])

Ah, I see that your function filters out the elements, that match 'p'. This simplifies the resulting list type ...

Bulat Ziganshin

5:18 p.m.

New subject: Proposal: hands off the base! :)

Hello Twan, Thursday, January 17, 2008, 2:00:37 AM, you wrote:

...

An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:

one more proposal to ruin all the programs that define this function themselves. i wonder whether all who propose to add function or two ever seen MissingH package? it includes a lot of such popular functions and anyone who needs them can install this package (or just borrow code) i make a contra-proposal - fix base library in 6.8 state and add to ghc distribution new libs with all the popular Monad, List and any other functions. this will allow: 1) precisely control which functions are available by importing exact versions of all libs (except of base which anyway will be frozen) 2) use all the new functions regardless of ghc version you are using. otherwise, adding `split` function to the base actually means that noone except for core hackers can use it for a year - because all these changes in base will go in production GHC version at the end of 2008 and because using user-defined split function will automatically make program incompatible with next GHC version overall, i want to use GHC for production, open-source programming and can formalize requirements which will allow to do this: 1) frozen base library interface, except for GHC.* modules 2) if we want to improve some base code, we *duplicate* it into new lib (with modified module names), publish first version of this libarry with exactly original code (and therefore equivalent interfaces) and then start to improve it, publishing newer and newer versions. imagine, for example, that we want to improve Data.Array.*: step 1: create library NewArray with modules Data.NewArray.* copied one-to-one from Data.Array.* and publish it as version 1 step 2: raise NewArray version to 2.0 and start making changes. once we've finished, raise version to 3.0 and keep interfaces for 3.* intact so anyone can import NewArray 3.* and got latest version with exactly the same interface as he used yes, this means that every functionality improved against the base package, should be installed in two versions - one inside base one in new package. but this is very natural taking into account that we can't change base without breaking all the code that relies on it. so, if we want new arrays, Handles or Exceptions - we will need to keep old version in base and add new one in other lib. overall, it should be recommended to not import directly anything from base but use separate libraries instead 3) GHC distribution should include all the popular libs (which, at the last end, should terminate rushes to include "popular functions" into the base!) with *MULTIPLE* versions - i.e. last 1.* version of NewArray, last 2.* version and so on. this will ensure that program developed in year 2007, will continue to compile with newest ghc versions in 2008, 2009 and so on. we can drop library from ghc distribution after, say, 3 years. i also propose that Haskell' ccommittee will decide every year which libs to include into Haskell standard libs set with exact major version. for example: year 2007: BS 1.*, Collections 2.*, HDBC 1.* year 2008: BS 2.*, Collections 2.*, HSQL 1.* year 2009: BS 2.*, Collections 2.*, HSQL 2.* ghc-2009 should include std libs from last 3 years, i.e. BS 1.*/2.*, Collections 2.*, HDBC 1.*, HSQL 1.*/2.* ghc-2010 may drop BS 1.* and HDBC 1.* support and of course should add newer libs from HL-2010 standard. the same should do other haskell compilers. this will significantly improve situation with Haskell standard libraries: 1) Haskell' committee will not need to develop artificial "standard libraries" set - actually, i think it can't and anyway any fixed set will become obsolete next year. libraries are most important part of any language and nowadays we can't develop proper stdlibs set "by committee". it should be made by community and committee should only "sign up" final results 2) every haskell distribution will include some guaranteed minimum of common, up-to-date libraries. any program written using H2009 specifications, will continue to run with any major Haskell compilers for a next 3 years. this should overcome "libraries hell" for mid-sized apps 3) any book or courses teaching Haskell can declare, for example, that it investigates Haskell-2009 and any Haskell2009-compatible compiler will provide both the syntax and libs discussed in the book. it will also mean that when you hire "Haskell2009-certified" specialist, you will be sure that he knows not only the language itself but also basic set of libs, equivalent of STL for C++ defining large standard set of libs was the major source of success for Java/C#/C++ last decade. we can go one step further and join development by community with standardization by committee. it should make Haskell better solution for developing large, long-standing products - by providing RICH set of MODERN STANDARD libs, which are guaranteed to run across compilers and years. criteria of inclusion library in this set are (obvious): 1) popularity (this can be fairly measured by downloads/installations/ user votes) 2) open-source, free license, unix/win and ghc/hugs compatibility -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Don Stewart

7:15 p.m.

New subject: Proposal: hands off the base! :)

bulat.ziganshin:

...

i make a contra-proposal - fix base library in 6.8 state and add to ghc distribution new libs with all the popular Monad, List and any other functions. this will allow:

2) if we want to improve some base code, we *duplicate* it into new lib (with modified module names), publish first version of this libarry with exactly original code (and therefore equivalent interfaces) and then start to improve it, publishing newer and newer versions. imagine, for example, that we want to improve Data.Array.*:

step 1: create library NewArray with modules Data.NewArray.* copied one-to-one from Data.Array.* and publish it as version 1

Data.Array is in the external 'array' package now. -- Don

Johan Tibell

21 Jan 21 Jan

9:13 a.m.

New subject: Proposal: hands off the base! :)

Hi Bulat, On Jan 17, 2008 6:18 PM, Bulat Ziganshin wrote:

...

step 1: create library NewArray with modules Data.NewArray.* copied one-to-one from Data.Array.* and publish it as version 1

Having words like "new" for the purpose of versioning is quite confusing because a library which is new at some point will eventually become old and then the name is misleading. Versioning doesn't belong in module/function names IMHO. -- Johan

Simon Marlow

18 Jan 18 Jan

1:28 p.m.

Twan van Laarhoven wrote:

...

Hello Haskellers,

An often requested function is 'split', to split a list into parts delimited by some separator. ByteString has the functions split and splitWith for this purpose. I propose we add equivalents to Data.List:

...
split :: Eq a => a -> [a] -> [[a]] split x = splitWith (x==)

splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p xs = ys : case zs of [] -> [] _:ws -> splitWith p ws where (ys,zs) = break p xs

trac: http://hackage.haskell.org/trac/ghc/ticket/2048 deadline: two weeks from now, January 30

Those interested in contributing to this discussion might like to review some of the past threads on this topic :-) http://www.haskell.org/pipermail/libraries/2006-July/005504.html http://www.haskell.org/pipermail/libraries/2006-July/005499.html http://www.haskell.org/pipermail/libraries/2006-July/005525.html http://www.haskell.org/pipermail/libraries/2006-October/006072.html and I'm sure there are earlier threads, but the above probably covers enough of the design space. Cheers, Simon

Bayley, Alistair

1:47 p.m.

...

From: libraries-bounces@haskell.org [mailto:libraries-bounces@haskell.org] On Behalf Of Simon Marlow

Twan van Laarhoven wrote:

...
Hello Haskellers,

An often requested function is 'split', to split a list into parts

Those interested in contributing to this discussion might like to review some of the past threads on this topic :-)

There is also this wiki page: http://haskell.org/haskellwiki/List_function_suggestions (which also contains links to the same discussions that Simon referenced.) Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

6371

Age (days ago)

6376

Last active (days ago)

List overview

Download

22 comments

11 participants

participants (11)

Bayley, Alistair
Bulat Ziganshin
Christian Maeder
Don Stewart
Duncan Coutts
Henning Thielemann
Ian Lynagh
Johan Tibell
Neil Mitchell
Simon Marlow
Twan van Laarhoven