Re: Why is there no splitBy in the list module?

marco-oweber:
There is already lines. Why not generalise it to take an additional parameter '\n' and call it split or splitBy? There are some cases where you want to split a list not on '\n'.
This comes up a lot. I thought at some time last year there'd been a long discussion about all the various flavours of 'split' that we could come up with. But then, what happened to the code? Did we not agree on a version to go into Data.List? Hacking up your own custom split (or a tokens/splitOnGlue) must be one of the most common questions from beginners on the irc channel. Even fps has a split/splitBy: http://www.cse.unsw.edu.au/~dons/fps/Data-ByteString.html#v%3Asplit http://www.cse.unsw.edu.au/~dons/fps/Data-ByteString.html#v%3AsplitWith http://www.cse.unsw.edu.au/~dons/fps/Data-ByteString.html#v%3Ajoin Anyone rememeber what the result of the "let's get split into the base library" movement's work was? -- Don (Moved to haskell-cafe@)

Hello Donald, Monday, July 10, 2006, 11:48:48 AM, you wrote:
Anyone rememeber what the result of the "let's get split into the base library" movement's work was?
it will be even better to ask permission from John Goerzen and move to Data.List the whole string&list beastiary from MissingH -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Donald Bruce Stewart wrote:
Hacking up your own custom split (or a tokens/splitOnGlue) must be one of the most common questions from beginners on the irc channel.
Even fps has a split/splitBy: http://www.cse.unsw.edu.au/~dons/fps/Data-ByteString.html#v%3Asplit http://www.cse.unsw.edu.au/~dons/fps/Data-ByteString.html#v%3AsplitWith http://www.cse.unsw.edu.au/~dons/fps/Data-ByteString.html#v%3Ajoin
Anyone rememeber what the result of the "let's get split into the base library" movement's work was?
ISTR there wasn't a concensus, so nothing happened. Which is silly, really - I agree we should definitely have a Data.List.split. Cheers, Simon

Simon Marlow schrieb:
Donald Bruce Stewart wrote:
Hacking up your own custom split (or a tokens/splitOnGlue) must be one of the most common questions from beginners on the irc channel.
Anyone rememeber what the result of the "let's get split into the base library" movement's work was?
ISTR there wasn't a concensus, so nothing happened. Which is silly, really - I agree we should definitely have a Data.List.split.
Maybe someone can extract a result from the old discussion about "Prelude function suggestions" http://thread.gmane.org/gmane.comp.lang.haskell.libraries/1684/focus=1684 Cheers Christian Our current (special) version is: {- | A function inspired by the perl function split. A list is splitted on a seperator element in smaller non-empty lists. The seperator element is dropped from the resulting list. -} splitOn :: Eq a => a -- ^ seperator -> [a] -- ^ list to split -> [[a]] splitOn x xs = let (l, r) = break (==x) xs in (if null l then [] else [l]) ++ (if null r then [] else splitOn x $ tail r)

On Mon, Jul 10, 2006 at 02:26:23PM +0200, Christian Maeder wrote:
Our current (special) version is:
{- | A function inspired by the perl function split. A list is splitted on a seperator element in smaller non-empty lists. The seperator element is dropped from the resulting list. -} splitOn :: Eq a => a -- ^ seperator -> [a] -- ^ list to split -> [[a]] splitOn x xs = let (l, r) = break (==x) xs in (if null l then [] else [l]) ++ (if null r then [] else splitOn x $ tail r)
just a note: I find splitBy a much nicer routine to provide.
splitBy :: (a -> Bool) -- ^ whether char is a seperator -> [a] -- ^ list to split -> [[a]]
John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham schrieb:
just a note: I find splitBy a much nicer routine to provide.
I would support this, if it helped to find a consensus. It's more difficult to decide if empty lists as elements of the result list (at the beginning, at the end or in the middle) should be returned. I would say yes, because removing is quite easy afterwards. The problem typically occurs with a final newline, that may produce an empty last line (that again may be closed by a further final newline by mistake) Christian
splitBy :: (a -> Bool) -- ^ whether element is a seperator -> [a] -- ^ list to split -> [[a]]
P.S. inspecting more than one element looks like an over-generalization to me and should be left to parsers or regexp libs.

splitBy :: (a -> Bool) -- ^ whether element is a seperator -> [a] -- ^ list to split -> [[a]]
P.S. inspecting more than one element looks like an over-generalization to me and should be left to parsers or regexp libs.
It's more generally useful if you don't drop the separators from the output: splitSepWith f = map (dropWhile f) . splitWith f spaces = splitSepWith Char.isSpace But this still won't let you split on comma and spaces. Either regexes, or pass in a [tok] -> ([conumed], [rest]) type parser: splitWith :: ([a] -> ([a], [a])) -> [a] -> [[a]] ... but why not make it take parsec parsers and put it in a parsec util module or something (if it isn't already there!): splitWith (Parsec.char ',' >> Parsec.spaces) ... of course, then you might ask why not use Parsec.sepBy :) but maybe the "split on elt" concept is easier to learn than "write a whole parser". I guess the problem with the splitWith thing is that it's a slippery path that leads right up to full-on parsers. Python gets by nicely with a split on 1 or more spaces and a split on a constant string, so maybe those are a good compromise between generality and simplicity.

Evan Laforge wrote:
splitBy :: (a -> Bool) -- ^ whether element is a seperator -> [a] -- ^ list to split -> [[a]]
P.S. inspecting more than one element looks like an over-generalization to me and should be left to parsers or regexp libs.
It's more generally useful if you don't drop the separators from the output:
splitSepWith f = map (dropWhile f) . splitWith f spaces = splitSepWith Char.isSpace
But this still won't let you split on comma and spaces. Either regexes, or pass in a [tok] -> ([conumed], [rest]) type parser:
splitWith :: ([a] -> ([a], [a])) -> [a] -> [[a]]
... but why not make it take parsec parsers and put it in a parsec util module or something (if it isn't already there!):
splitWith (Parsec.char ',' >> Parsec.spaces)
... of course, then you might ask why not use Parsec.sepBy :) but maybe the "split on elt" concept is easier to learn than "write a whole parser".
I guess the problem with the splitWith thing is that it's a slippery path that leads right up to full-on parsers.
Exactly, and this is why we didn't reach a concensus last time. Would someone like to make a concrete proposal (with code!) for 2-3 functions we could reasonably add to Data.List? Cheers, Simon

simonmarhaskell:
I guess the problem with the splitWith thing is that it's a slippery path that leads right up to full-on parsers.
Exactly, and this is why we didn't reach a concensus last time.
Would someone like to make a concrete proposal (with code!) for 2-3 functions we could reasonably add to Data.List?
No parsers! I vote for this, currently implemented in Data.ByteString: -- | split on characters split :: Char -> String -> [String] -- | split on predicate * splitBy :: (Char -> Bool) -> String -> [String] and -- | split on a string tokens :: String -> String -> [String] Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"] I argue the second form is what people usually want. -- Don

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Donald Bruce Stewart
I vote for this, currently implemented in Data.ByteString:
-- | split on characters split :: Char -> String -> [String]
-- | split on predicate * splitBy :: (Char -> Bool) -> String -> [String]
and -- | split on a string tokens :: String -> String -> [String]
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
Based on...? Does tokens preserve empty fields, like the proposed first form of splitBy? There is a fairly strong case for the first form when looking at CSV file parsers e.g. splitBy (== ',') "a,b,,d" == ["a", "b", "", "d"] In this case you want to correctly notice that a given field is empty. As Christian Maeder notes, it's pretty easy to filter them out afterwards, if that's what you want. OTOH, if tokens does this, then I'm not concerned that splitBy doesn't. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Hello Donald, Wednesday, July 12, 2006, 12:55:50 PM, you wrote:
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
1) at least for 'lines' people prefer first 2) second can be easily got from the first -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Donald Bruce Stewart schrieb:
No parsers!
I agree,
I vote for this, currently implemented in Data.ByteString:
from which package comes Data.ByteString?
-- | split on characters split :: Char -> String -> [String]
the type for lists should then be: split :: Eq a => a -- ^ seperator -> [a] -- ^ list to split -> [[a]] (as I proposed earlier as "splitOn")
-- | split on predicate * splitBy :: (Char -> Bool) -> String -> [String]
According to Data.PackedString (with splitPS and splitWithPS) the name should be "splitWith" for consistency (or splitWithPS should be renamed as well).
-- | split on a string tokens :: String -> String -> [String]
I don't think, that we need this function for lists.
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
Yes, the second form is needed for "words", but the first form is needed for "lines", where one final empty element needs to be removed from your version! Prelude> lines "a\nb\n" ["a","b"] Prelude> lines "a\n\nb\n\n" ["a","","b",""] One more question is whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["","","bb","c"] This second form corresponds to splitPS but the first more general form may be desirable as well. Christian

Christian Maeder wrote:
Donald Bruce Stewart schrieb:
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
Yes, the second form is needed for "words", but the first form is needed for "lines", where one final empty element needs to be removed from your version!
Prelude> lines "a\nb\n" ["a","b"] Prelude> lines "a\n\nb\n\n" ["a","","b",""]
Prelude.lines and Prelude.unlines treat '\n' as a terminator instead of a separator. I'd argue that this is poor design, since information is lost ie lines . unlines === id whereas unlines . lines =/= id whereas if '\n' had been properly conceived of as a separator, the identity would hold. So I vote for the first option ie: splitBy (=='a') "aabbaca" == ["","","bb","c",""] Regards, Brian. -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com

On 2006-07-12 at 23:24BST "Brian Hulley" wrote:
Christian Maeder wrote:
Donald Bruce Stewart schrieb:
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
Yes, the second form is needed for "words", but the first form is needed for "lines", where one final empty element needs to be removed from your version!
Prelude> lines "a\nb\n" ["a","b"] Prelude> lines "a\n\nb\n\n" ["a","","b",""]
Prelude.lines and Prelude.unlines treat '\n' as a terminator instead of a separator. I'd argue that this is poor design, since information is lost ie lines . unlines === id whereas unlines . lines =/= id whereas if '\n' had been properly conceived of as a separator, the identity would hold.
Hooray! I've been waiting to ask "Why aren't we asking what laws hold for these operations?" but now you've saved me the effort. I've been bitten by unlines . lines /= id already; it's something we could gainfully change without wrecking too much code, methinks.
So I vote for the first option ie:
splitBy (=='a') "aabbaca" == ["","","bb","c",""]
Seconded. As far as naming is concerned, since this is a declarative language, surely we shouldn't be using active verbs like this? (OK I lost that argument way back in the mists of Haskell 0.0 with take. Before then I called "take" "first": "first n some_list" reads perfectly well). Jón -- Jón Fairbairn Jon.Fairbairn at cl.cam.ac.uk

On 2006-07-13 at 10:16BST I wrote:
Hooray! I've been waiting to ask "Why aren't we asking what laws hold for these operations?"
Having thought about this for a bit, I've come up with the below. This is intended to give the general idea -- it's not polished code, and I'm not at all wedded to the names I've used, and it almost certainly should be split up.
module Parts (parts, fromParts, contiguousParts, segmentsSatisfying) where import List (groupBy)
parts p = map hack . groupBy sameSide . map (predicateToEither p) fromParts = concat . map fromEither
Now we should have fromParts . parts p ⊑ (id:: [a]->[a]) In particular, it should be possible to apply a function to all the Right parts, and then reconstruct the list with the Left parts left alone. for example fromParts . mapRight uc_first . parts Char.isAlpha $ "A random list of words\non lines" where uc_first [] = [] uc_first (a:as) = Char.toUpper a:as => "A Random List Of Words\nOn Lines"
contiguousParts p l = [a | Right a <- parts p l]
so words = contiguousParts Char.isAlphaNum
segmentsSatisfying predicate = concat . map dropSeps . parts predicate where dropSeps e = case e of Left x -> map (const []) $ tail x Right r -> [r]
So lines = segmentsSatisfying (/= '\n') ... but the tail in the definition of segmentsSatisfying makes me uneasy. needing the function `hack` suggests that the definition of parts is written badly
hack (Left x:rest) = Left (x: map (\(Left x) -> x) rest) hack (Right x:rest) = Right (x: map (\(Right x) -> x) rest)
what follows must surely exist in a library somewhere? I'd expect it to be called Either...
predicateToEither :: (a -> Bool) -> a -> Either a a predicateToEither p x = if p x then Right x else Left x
sameSide (Left _) (Left _) = True sameSide (Right _) (Right _) = True sameSide _ _ = False
fromEither (Left x) = x fromEither (Right x) = x
liftE f1 f2 = either (Left . f1) (Right . f2)
mapRight f = map (onRight f)
onRight f = liftE id f
mapLeft f = map (onLeft f) onLeft f = liftE f id
we could do some of half of those using this:
instance Functor (Either a) where fmap f (Right a) = Right (f a) fmap f (Left l) = Left l
Is a Monad instance any use?
instance Monad (Either a) where Right a >>= f = f a Left l >>= f = Left l return = Right
-- Jón Fairbairn Jon.Fairbairn at cl.cam.ac.uk

On 2006-07-20 at 18:31BST I wrote:
On 2006-07-13 at 10:16BST I wrote:
Hooray! I've been waiting to ask "Why aren't we asking what laws hold for these operations?"
Having thought about this for a bit, I've come up with the below. This is intended to give the general idea -- it's not polished code,
And already some changes (but all to comments) I'd appreciate comments on the law below. --- lib/split.lhs 2006/07/21 08:54:28 1.1 +++ lib/split.lhs 2006/07/21 09:42:49 @@ -17,10 +17,17 @@ => "A Random List Of Words\nOn Lines" + I think the relevant law is + (forall x. all (not . p) (map f x)) => + segmentsSatisfying (not . p) . fromParts . mapRight f . parts p + == segmentsSatisfying (not . p) + + In other words, if f doesn't add any "not . p" elements, + the segments satisfying "not . p" are unchanged.
contiguousParts p l = [a | Right a <- parts p l]
- so words = contiguousParts Char.isAlphaNum + so words = contiguousParts (not . Char.isSpace)
segmentsSatisfying predicate = concat . map dropSeps . parts predicate
-- Jón Fairbairn Jon.Fairbairn at cl.cam.ac.uk

On Wed, 12 Jul 2006, Donald Bruce Stewart wrote:
I vote for this, currently implemented in Data.ByteString:
-- | split on characters split :: Char -> String -> [String]
-- | split on predicate * splitBy :: (Char -> Bool) -> String -> [String]
and -- | split on a string tokens :: String -> String -> [String]
OED on "token": 3b [Computing] The smallest meaningful unit of information in sequence of data for a compiler. I think that's more or less what it means to me, too. It may be possible to come up with a name that is more likely to suggest what it does and less likely to collide with identifiers used elsewhere. Maybe "splits", but anyway ideally including "split". Of course technically we seem to be talking about lists, but this last one is surely mostly about strings.
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
People will want both. The second form can be computed from the first, because it discards information about the input string, but for the same reason of course the first can't be derived from the second. (I'm not the first to say that, but since mail to this list has been arriving out of order, here it is again.) The convention I know, possibly coming from the world of UNIX shell tools, the default white space split is type 2, but split on any other string is type 1. UNIX shell does that, awk, Python ... (Perl is awk gone horribly wrong, so it presumably does but if it doesn't, it's the exception that proves the rule.) It has worked for a lot of people who do a lot of splitting, for a lot of years. Donn Cave, donn@drizzle.com
participants (10)
-
Bayley, Alistair
-
Brian Hulley
-
Bulat Ziganshin
-
Christian Maeder
-
Donn Cave
-
dons@cse.unsw.edu.au
-
Evan Laforge
-
John Meacham
-
Jon Fairbairn
-
Simon Marlow