Re: Why is there no splitBy in the list module?

Evan Laforge wrote:
splitBy :: (a -> Bool) -- ^ whether element is a seperator -> [a] -- ^ list to split -> [[a]]
P.S. inspecting more than one element looks like an over-generalization to me and should be left to parsers or regexp libs.
It's more generally useful if you don't drop the separators from the output:
splitSepWith f = map (dropWhile f) . splitWith f spaces = splitSepWith Char.isSpace
But this still won't let you split on comma and spaces. Either regexes, or pass in a [tok] -> ([conumed], [rest]) type parser:
splitWith :: ([a] -> ([a], [a])) -> [a] -> [[a]]
... but why not make it take parsec parsers and put it in a parsec util module or something (if it isn't already there!):
splitWith (Parsec.char ',' >> Parsec.spaces)
... of course, then you might ask why not use Parsec.sepBy :) but maybe the "split on elt" concept is easier to learn than "write a whole parser".
I guess the problem with the splitWith thing is that it's a slippery path that leads right up to full-on parsers.
Exactly, and this is why we didn't reach a concensus last time. Would someone like to make a concrete proposal (with code!) for 2-3 functions we could reasonably add to Data.List? Cheers, Simon

simonmarhaskell:
I guess the problem with the splitWith thing is that it's a slippery path that leads right up to full-on parsers.
Exactly, and this is why we didn't reach a concensus last time.
Would someone like to make a concrete proposal (with code!) for 2-3 functions we could reasonably add to Data.List?
No parsers! I vote for this, currently implemented in Data.ByteString: -- | split on characters split :: Char -> String -> [String] -- | split on predicate * splitBy :: (Char -> Bool) -> String -> [String] and -- | split on a string tokens :: String -> String -> [String] Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"] I argue the second form is what people usually want. -- Don

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Donald Bruce Stewart
I vote for this, currently implemented in Data.ByteString:
-- | split on characters split :: Char -> String -> [String]
-- | split on predicate * splitBy :: (Char -> Bool) -> String -> [String]
and -- | split on a string tokens :: String -> String -> [String]
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
Based on...? Does tokens preserve empty fields, like the proposed first form of splitBy? There is a fairly strong case for the first form when looking at CSV file parsers e.g. splitBy (== ',') "a,b,,d" == ["a", "b", "", "d"] In this case you want to correctly notice that a given field is empty. As Christian Maeder notes, it's pretty easy to filter them out afterwards, if that's what you want. OTOH, if tokens does this, then I'm not concerned that splitBy doesn't. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Hello Donald, Wednesday, July 12, 2006, 12:55:50 PM, you wrote:
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
1) at least for 'lines' people prefer first 2) second can be easily got from the first -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Donald Bruce Stewart schrieb:
No parsers!
I agree,
I vote for this, currently implemented in Data.ByteString:
from which package comes Data.ByteString?
-- | split on characters split :: Char -> String -> [String]
the type for lists should then be: split :: Eq a => a -- ^ seperator -> [a] -- ^ list to split -> [[a]] (as I proposed earlier as "splitOn")
-- | split on predicate * splitBy :: (Char -> Bool) -> String -> [String]
According to Data.PackedString (with splitPS and splitWithPS) the name should be "splitWith" for consistency (or splitWithPS should be renamed as well).
-- | split on a string tokens :: String -> String -> [String]
I don't think, that we need this function for lists.
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
Yes, the second form is needed for "words", but the first form is needed for "lines", where one final empty element needs to be removed from your version! Prelude> lines "a\nb\n" ["a","b"] Prelude> lines "a\n\nb\n\n" ["a","","b",""] One more question is whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["","","bb","c"] This second form corresponds to splitPS but the first more general form may be desirable as well. Christian

Simon Marlow schrieb:
Would someone like to make a concrete proposal (with code!) for 2-3 functions we could reasonably add to Data.List?
Here is my proposal that is consistent with Data.PackedString and "lines" (i.e a final delimiter is ignored -- by extra code) {- | The 'splitWith' function takes a predicate and splits the input list at each element which satisfies the predicate. -} splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p s = case s of [] -> [] _ -> let (l, r) = break p s in case r of _ : t@(_ : _) -> l : splitWith p t _ -> [l] {- | The 'split' function splits the input list on each occurrence of the given element. -} split :: Eq a => a -> [a] -> [[a]] split c = splitWith (== c)

maeder:
Simon Marlow schrieb:
Would someone like to make a concrete proposal (with code!) for 2-3 functions we could reasonably add to Data.List?
Here is my proposal that is consistent with Data.PackedString and "lines" (i.e a final delimiter is ignored -- by extra code)
I was thinking of Data.ByteString, but no matter. I'ved added some QuickChecks, and suggest changing splitWith to splitBy, following the other List functions with explicit predicates, sortBy, groupBy, minimumBy, ... Looks like a nice, simple minimal change that could be done. -- | The 'splitBy' function takes a predicate and splits the input -- list at each element which satisfies the predicate. splitBy :: (a -> Bool) -> [a] -> [[a]] splitBy p s = case s of [] -> [] _ -> let (l, r) = break p s in case r of _ : t@(_:_) -> l : splitBy p t _ -> [l] -- | The 'split' function splits the input list on each occurrence of -- the given element. split :: Eq a => a -> [a] -> [[a]] split c = splitBy (== c) ------------------------------------------------------------------------ -- -- QuickChecks module Data.Split where import Data.List import Data.Char import Test.QuickCheck import Test.QuickCheck.Batch prop_lines_split xs = lines xs == split '\n' xs prop_words_split xs = words xs == split ' ' xs test_split = runTests "split" defOpt [ run prop_lines_split , run prop_words_split ] instance Arbitrary Char where arbitrary = choose (minBound, maxBound) coarbitrary c = variant (ord c `rem` 4)

Donald Bruce Stewart schrieb:
I'ved added some QuickChecks, and suggest changing splitWith to splitBy, following the other List functions with explicit predicates, sortBy, groupBy, minimumBy, ...
fine (if also splitWithPS is changed to splitByPS) The other versions (not ignoring a final delimiter and deleting all empty lists) are desirable, too. Are there any suggestions for canonical names? The implementations could be hidden (and more efficient) then (by avoiding to traverse a list twice). How about this? splitFields, splitWords, splitLines (= split ?) splitFieldsBy, splitWordsBy, splitLinesBy (= splitBy ?) splitFields 'a' "baaba" -> ["b", "", "b", ""] splitLines 'a' "baaba" -> ["b", "", "b"] splitWords 'a' "baaba" -> ["b", "b"] Cheers Christian

Christian Maeder
The other versions (not ignoring a final delimiter and deleting all empty lists) are desirable, too. Are there any suggestions for canonical names? The implementations could be hidden (and more efficient) then (by avoiding to traverse a list twice).
Surely list deforestation should take care of the efficiency question?
splitFields 'a' "baaba" -> ["b", "", "b", ""] splitLines 'a' "baaba" -> ["b", "", "b"] splitWords 'a' "baaba" -> ["b", "b"]
Did you mean this? splitFields 'a' "baaba" -> ["b", "", "", "b", ""] splitLines 'a' "baaba" -> ["b", "", "", "b"] splitWords 'a' "baaba" -> ["b", "b"] Regards, Malcolm

Malcolm Wallace schrieb:
Christian Maeder
wrote: splitFields 'a' "baaba" -> ["b", "", "b", ""] splitLines 'a' "baaba" -> ["b", "", "b"] splitWords 'a' "baaba" -> ["b", "b"]
Did you mean this?
splitFields 'a' "baaba" -> ["b", "", "", "b", ""] splitLines 'a' "baaba" -> ["b", "", "", "b"] splitWords 'a' "baaba" -> ["b", "b"]
no! lines = splitLines '\n' *Main> lines "b\n\nb\n" ["b","","b"]

Donald Bruce Stewart schrieb:
maeder:
-- | The 'splitBy' function takes a predicate and splits the input -- list at each element which satisfies the predicate. splitBy :: (a -> Bool) -> [a] -> [[a]] splitBy p s = case s of [] -> [] _ -> let (l, r) = break p s in case r of _ : t@(_:_) -> l : splitBy p t _ -> [l]
-- | The 'split' function splits the input list on each occurrence of -- the given element. split :: Eq a => a -> [a] -> [[a]] split c = splitBy (== c)
prop_lines_split xs = lines xs == split '\n' xs
prop_words_split xs = words xs == split ' ' xs
The last property is wrong. It must be: prop_words_split = words xs == filter (not . null) (split ' ' xs) Another property of my suggested functions: splitWords, splitLines, splitFields :: Eq a => a -> [a] -> [[a]] id = concat . intersperse [c] . splitFields c -- :: [a] -> [a] splitLines is equal to splitFields as long as there is no last seperator (or terminator) element. splitWords can be easily derived from splitLines or splitFields via "filter (not . null)". splitFields c = splitLines c . (++ [c]) Which of the two versions splitFields or splitLines should get the general name "split" is a matter of taste and of compatibility to the currently incompatible Data.PackedString or (your) Data.ByteString. If we cannot agree, I'ld suggest to keep (my) longer names, only. Cheers Christian

After all our discussions, I simply want to keep up my old proposal. It drops a final delimiter and is named using "With" instead of "By" and therefore would not imply any changes to Data.PackedString. Isn't that better than changing nothing (after so many people asked for a split function)? Christian Christian Maeder schrieb:
Simon Marlow schrieb:
Would someone like to make a concrete proposal (with code!) for 2-3 functions we could reasonably add to Data.List?
Here is my proposal that is consistent with Data.PackedString and "lines" (i.e a final delimiter is ignored -- by extra code)
{- | The 'splitWith' function takes a predicate and splits the input list at each element which satisfies the predicate. -} splitWith :: (a -> Bool) -> [a] -> [[a]] splitWith p s = case s of [] -> [] _ -> let (l, r) = break p s in case r of _ : t@(_ : _) -> l : splitWith p t _ -> [l]
{- | The 'split' function splits the input list on each occurrence of the given element. -} split :: Eq a => a -> [a] -> [[a]] split c = splitWith (== c)

Christian Maeder
After all our discussions, I simply want to keep up my old proposal. It drops a final delimiter and is named using "With" instead of "By" and therefore would not imply any changes to Data.PackedString.
I think the general objection to the "With" suffix is that it is non-standard naming. All the Haskell'98 libraries use "By" for this kind of meaning. Data.PackedString is non-standard, and should not be relied upon as an accurate stylistic guide. Regards, Malcolm

On Tue, 2006-07-18 at 12:27 +0100, Malcolm Wallace wrote:
Christian Maeder
wrote: After all our discussions, I simply want to keep up my old proposal. It drops a final delimiter and is named using "With" instead of "By" and therefore would not imply any changes to Data.PackedString.
I think the general objection to the "With" suffix is that it is non-standard naming. All the Haskell'98 libraries use "By" for this kind of meaning. Data.PackedString is non-standard, and should not be relied upon as an accurate stylistic guide.
Indeed, the Data.PackedString API aspires to follow existing practise from Data.List and elsewhere in the standard libs. So if there is a discrepancy then it is Data.PackedString that should change. We'd be happy to do that, especially as the API is not yet set in stone. Duncan

Malcolm Wallace schrieb:
I think the general objection to the "With" suffix is that it is non-standard naming. All the Haskell'98 libraries use "By" for this kind of meaning.
Looking at Data.List there is not a single function with suffix "By" that takes a predicate argument (filter, dropWhile, etc.). All "...By" function (nubBy, ..., minimumBy) take binary comparison arguments.
Data.PackedString is non-standard, and should not be relied upon as an accurate stylistic guide.
I consider it to be standard (currently) since it is in package base. However, I've no problem if "With" is replaces with "By" in my proposal (and also in Data.PackedString) Christian

Christian Maeder wrote:
Malcolm Wallace schrieb:
I think the general objection to the "With" suffix is that it is non-standard naming. All the Haskell'98 libraries use "By" for this kind of meaning.
Looking at Data.List there is not a single function with suffix "By" that takes a predicate argument (filter, dropWhile, etc.).
All "...By" function (nubBy, ..., minimumBy) take binary comparison arguments.
Hmm... splitWhen ? However, by analogy with 'break' and 'span' it should be just 'split', and if you want to split on white space, you say so: 'split isSpace'. I also think, this version of 'split' (or whatever you want to call it) is all that's needed, because *> lines == split (== '\n') *> words == filter (not.null) . split isSpace and anyone who needs more than that is better served with alex, some sort of regexes or Manuel Chakravarty's lexer combinators. Udo.

Udo Stenzel schrieb:
However, by analogy with 'break' and 'span' it should be just 'split', and if you want to split on white space, you say so: 'split isSpace'.
I also think, this version of 'split' (or whatever you want to call it) is all that's needed, because
*> lines == split (== '\n') *> words == filter (not.null) . split isSpace
good idea. I support this, too. Christian
participants (8)
-
Bayley, Alistair
-
Bulat Ziganshin
-
Christian Maeder
-
dons@cse.unsw.edu.au
-
Duncan Coutts
-
Malcolm Wallace
-
Simon Marlow
-
Udo Stenzel