Adding split/split' to Data.List, and redefining words/lines with it; also, adding replace/replaceBy

Hi everyone. So recently while doing some shell scripting, I found myself redefining a 'split' function (take an item and a list, and make list of lists everywhere the item appears) *yet again*, and I got annoyed enough to resolve to fix the situation. While I was at it, I decided that since 'lines' and 'words' are conceptually specializations of a general split function, I would come up with rewrites for them too; much more aesthetically satisfying to me, as it's clearer in the code now that lines and words are essentially a specialization of split, but with pragmatic edge cases (which mean we don't get nice identities like 'unlines . lines == id', but makes them more useful with, say, getContents). Code:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > lines' :: String -> [String] > lines' s = removeTrailingNull (split' '\n' s) > where > removeTrailingNull :: [String] -> [String] > removeTrailingNull y = case y of > [] -> [] > [""] -> [] > (x:xs) -> x : removeTrailingNull xs > linesProp :: String -> Bool > linesProp x = (Prelude.lines x == lines' x) > words' :: String -> [String] > words' = filter (not . and . map isSpace) . split isSpace > wordsProp :: String -> Bool > wordsProp x = (Prelude.words x == words' x) > split :: (a -> Bool) -> [a] -> [[a]] > split _ [] = [] > split p s = let (l,s') = break p s in l : case s' of > [] -> [] > (r:s'') -> [r] : split p s'' > splitUndoProp, splitUndoIdemProp, splitPreserveDelimsProp :: (Eq a) => a -> [a] > > -> Bool > splitUndoProp x y = (concat $ split (==x) y) == y > splitUndoIdemProp x y = (concat $ concat $ split (==[x]) $ split (==x) y) == y > splitPreserveDelimsProp x y = (length $ elemIndices [x] $ split (==x) y) == > > (length $ elemIndices x y) > split' :: (Eq a) => a -> [a] -> [[a]] > split' a b = filter (/= [a]) $ split (\x -> x==a) b > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I've run many many QuickChecks testing against the Prelude lines and words, and the definitions seem to be correct. What do people think of adding these? I know I'm not the only one who has wanted split or split' on more than one occasion, and they are not the funnest functions to rewrite every time you want them. (About all they're missing are Haddocks; and perhaps a better name for split' which reflects how it is lossy and can't be undone while split can be.) ------ On a secondary note, but less important than the foregoing, I'd like to add two functions: 'replace' and 'replaceBy'. They do basically what they sound like: given two items, change every occurrence in a given list of one item to another. These are two other functions I often have to redefine, which still surprises me - Data.List has a surfeit of obscure functions I've never used and which are kind of odd, but a basic search-and-replace function isn't there? I mean, I'm not saying let's add enough functions to Data.List to turn it into a mini-Perl, but it strikes me as a real gap. (As before, I've defined some sensible QC properties and checked, although the definitions look obviously right to me.) Code:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > replaceBy :: (a -> Bool) -> a -> [a] -> [a] > replaceBy a b = map (\x -> if a x then b else x) > replace :: (Eq a) => a -> a -> [a] -> [a] > replace a = replaceBy (==a) > replaceLengthProp :: (Eq a) => a -> a -> [a] -> Bool > replaceLengthProp x y z = (length $ replace x y z) == (length z) > replaceUndoableProp :: (Eq a) => a -> a -> [a] -> Bool > replaceUndoableProp x y z = if not (y `elem` z) then z == (replace y x $ replace > > x y z) else True > replaceIdempotentProp :: (Eq a) => a -> a -> [a] -> Bool > replaceIdempotentProp x y z = (replace x y $ replace x y z) == (replace x y z) > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-- gwern PRF fritz Lon News IG Keyhole advise VFCT SITOR MDA

On Thu, Jul 10, 2008 at 10:55:23AM -0400, Gwern Branwen wrote:
words' :: String -> [String] words' = filter (not . and . map isSpace) . split isSpace
Note that (and . map f) == all f, which makes this code a bit simpler and clearer. David

On 2008.07.10 08:59:20 -0700, David Roundy
On Thu, Jul 10, 2008 at 10:55:23AM -0400, Gwern Branwen wrote:
words' :: String -> [String] words' = filter (not . and . map isSpace) . split isSpace
Note that (and . map f) == all f, which makes this code a bit simpler and clearer.
David
Good point. I'll make that change. -- gwern Consulting radint MD4 Security cocaine advise Uzi Chan RRF Firewalls

Hi
What do people think of adding these?
split is sorely lacking, and definately needs to be included. However, my version is different to yours: split :: Eq a => a -> [a] -> [[a]] split x [] = [] split x xs = if null b then [a] else a : split x (tail b) where (a,b) = break (== x) xs split '*' "hello*neil" = ["hello","neil"] While with yours: split '*' "hello*neil" = ["hello","*","neil"] I much prefer mine. Didn't the bytestring people add it, under some gise, to their library? It should be consistent with that.
and perhaps a better name for split'
A better name is essential. split' should be for the strict version of split, not something quite different.
On a secondary note, but less important than the foregoing, I'd like to add two functions: 'replace' and 'replaceBy'. They do basically what they sound like: given two items, change every occurrence in a given list of one item to another.
I commonly define: rep :: Eq a => a -> a -> a rep from to x = if x == from then to else x Now you can do replace with map rep. Still, replace and replaceBy might be useful to have. Thanks Neil

On Fri, Jul 11, 2008 at 12:11:15AM +0100, Neil Mitchell wrote:
Hi
What do people think of adding these?
split is sorely lacking, and definately needs to be included. However, my version is different to yours:
I would suggest that when everyone defines a different version of a function, then it isn't actually such a good function to put into the standard libraries. David

Hi, Neil's version of split is has the same kind of behavior than Java's, Ruby's, Python's where the separator is not included in the splitted list (although there are other differences between those versions). I guess that should be possible to include a commonly used version of split in the standard libraries then, even if it is the "Haskell version" of split. Eric. -- View this message in context: http://www.nabble.com/Adding-split-split%27-to-Data.List%2C-and-redefining-w... Sent from the Haskell - Libraries mailing list archive at Nabble.com.

On Thu, 2008-07-10 at 19:40 -0700, Eric Torreborre wrote:
Hi,
Neil's version of split is has the same kind of behavior than Java's, Ruby's, Python's
Add perl to that list, as well.
where the separator is not included in the splitted list (although there are other differences between those versions).
I guess that should be possible to include a commonly used version of split in the standard libraries then, even if it is the "Haskell version" of split.
Agreed. My first reaction was to wonder why split wasn't already in MissingPy. jcc

On 2008 Jul 10, at 22:49, Jonathan Cast wrote:
On Thu, 2008-07-10 at 19:40 -0700, Eric Torreborre wrote:
Hi,
Neil's version of split is has the same kind of behavior than Java's, Ruby's, Python's
Add perl to that list, as well.
Perl's is configurable: if the regexp includes a capture, the captured part of each separator will be included in the returned list. Thus split(/(,)/, @foo) will capture the commas between the list items. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Thu, 2008-07-10 at 22:59 -0400, Brandon S. Allbery KF8NH wrote:
On 2008 Jul 10, at 22:49, Jonathan Cast wrote:
On Thu, 2008-07-10 at 19:40 -0700, Eric Torreborre wrote:
Hi,
Neil's version of split is has the same kind of behavior than Java's, Ruby's, Python's
Add perl to that list, as well.
Perl's is configurable: if the regexp includes a capture, the captured part of each separator will be included in the returned list. Thus split(/(,)/, @foo) will capture the commas between the list items.
I got caught out in a detail of perl's semantics. This is extremely embarrassing. In my defense, I can only (weakly) point out that both versions proposed thus far have type String -> String -> [String] (or a generalization thereof) and that perl's behavior when passed a regular string (i.e., a return value of quotemeta) does match Neil's. jcc

On 2008-07-10 19:40 -0700 (Thu), Eric Torreborre wrote:
Neil's version of split is has the same kind of behavior than Java's, Ruby's, Python's where the separator is not included in the splitted list....
Actually, I seem to recall there was something in Ruby somewhere that did a split-type thing, but gave you the option of including the delimiter or not. But it's not in String#split, anyway. Ruby also has special behaviour for including or not including empty fields at the beginning or end. E.g.: "1,2,,3,4,,".split(',', 4) #=> ["1", "2", "", "3,4,,"] "1,2,,3,4,,".split(',', -4) #=> ["1", "2", "", "3", "4", "", ""]
I guess that should be possible to include a commonly used version of split in the standard libraries then, even if it is the "Haskell version" of split.
Or perhaps we could find some way of specifying how the split is
supposed to work.
cjs
--
Curt Sampson

On Thu, 10 Jul 2008, David Roundy wrote:
On Fri, Jul 11, 2008 at 12:11:15AM +0100, Neil Mitchell wrote:
Hi
What do people think of adding these?
split is sorely lacking, and definately needs to be included. However, my version is different to yours:
I would suggest that when everyone defines a different version of a function, then it isn't actually such a good function to put into the standard libraries.
I think the problem is, that there is no commonly agreed data structure for a list of alternating elements, like this one: http://darcs.haskell.org/event-list/src/Data/AlternatingList/List/Uniform.hs With this intermediate form, you could have a generic 'split' function split :: (a -> Bool) -> [a] -> AlternatingList a [a] from which you can derive several flavours easily, by extracting only the [a] elements, or by turning 'a' into [a] and flatten the alternating list then, or by fusing pairs of 'a' and [a].

On Thu, Jul 10, 2008 at 9:18 PM, David Roundy
On Fri, Jul 11, 2008 at 12:11:15AM +0100, Neil Mitchell wrote:
Hi
What do people think of adding these?
split is sorely lacking, and definately needs to be included. However, my version is different to yours:
I would suggest that when everyone defines a different version of a function, then it isn't actually such a good function to put into the standard libraries.
David
Personally, I disagree. The way I see, the proliferation of splits is a cry for help, pointing to a need for some sort of centralization and cleanup. As it is, there are something like 5 or 6 splits floating around out there (split, split´/splitNeil, splitTwanvl, splitBytestring, splitHSH, and a few others I have not really looked at*). Each one is defined a little differently, and looks different - even if they turn out to be the exact same thing like with split´/splitNeil; so each time a Haskeller needs this they have to roll their own or use someone else´s roll, and of course each one has subtly different edge cases, so there´s no guarantee you actually understand a given split implementation. I would regard having a few well-documented, tested splits in Data.List as an enormous improvement on the status quo - and I would still regard it as so even if it turned out that one needs to add, say, 4 splits. The situation makes me a little queasy. Tons of people like dons or Simon Marlowe or so on agree that a basic list processing function is needed, and we´ve been agreed on that point for two years, but somehow nothing improves, and I go on rewriting split in scripts every so often. * http://www.haskell.org/pipermail/haskell-cafe/2006-July/016574.html -- gwern

On 2008.07.11 00:11:15 +0100, Neil Mitchell
Hi
What do people think of adding these?
split is sorely lacking, and definately needs to be included. However, my version is different to yours:
split :: Eq a => a -> [a] -> [[a]] split x [] = [] split x xs = if null b then [a] else a : split x (tail b) where (a,b) = break (== x) xs
split '*' "hello*neil" = ["hello","neil"]
While with yours:
split '*' "hello*neil" = ["hello","*","neil"]
I much prefer mine.
Well, your version of split is entirely respectable. And according to QuickCheck, identical to my split' (I knew sending this that there would be dog-shed issues, and I was hoping to avoid them): splitNeil :: Eq a => a -> [a] -> [[a]] splitNeil x [] = [] splitNeil x xs = if null b then [a] else a : splitNeil x (tail b) where (a,b) = break (== x) xs splitNeilProp x y = splitNeil x y == split' x y *Foo> quickCheck splitNeilProp +++ OK, passed 100 tests. (Still, I think I'll hold onto this definition. It isn't obviously working the same way to my eyes, and could provide a useful sanity check for split'.)
Didn't the bytestring people add it, under some gise, to their library? It should be consistent with that.
The bytestring split is apparently different from anything discussed here. That is, http://hackage.haskell.org/packages/archive/bytestring/0.9.1.0/doc/html/Data... says that: split 'a' "aXaXaXa" == ["","X","X","X",""] while my split: split (=='a') "aXaXaXa" == ["","a","X","a","X","a","X","a"] and split'/neilSplit: split' 'a' "aXaXaXa" == ["","X","X","X"] I'm not sure it's all that important though, as ByteString.split and my split both are invertible (intercalate [c] . ByteStringsplit c == id; (concat $ split (==x) y) == y) and split' isn't. --- More generally, I feel the library should have both. Even though split' is simple in terms of split, this discussion shows that people want to consume delimiters at times, and having both makes it possible to define lines and words both using them.
and perhaps a better name for split'
A better name is essential. split' should be for the strict version of split, not something quite different.
Yes, that's true. I wasn't thinking of strictness but of the prime notation - ie. here's another, specialized version of split. So what would you suggest? splitConsume? splitLossy? splitAndShrink?
On a secondary note, but less important than the foregoing, I'd like to add two functions: 'replace' and 'replaceBy'. They do basically what they sound like: given two items, change every occurrence in a given list of one item to another.
I commonly define:
rep :: Eq a => a -> a -> a rep from to x = if x == from then to else x
Now you can do replace with map rep.
Still, replace and replaceBy might be useful to have.
Thanks
Neil
-- gwern cryptogon Playboy Duress UXO Veiligheidsdienst B43 screws Poseidon AST BCCI

Just wanted to chime in with some historical perspective (no doubt with a few gaping holes, so feel free to correct me). There's a brief summary on the wiki of the prior discussions: http://www.haskell.org/haskellwiki/List_function_suggestions These are the two previous threads: http://www.haskell.org/pipermail/haskell-cafe/2006-July/thread.html#16559 http://www.haskell.org/pipermail/libraries/2004-July/thread.html#2342 Some interesting points: - the lines/words analogy to split/split': http://www.haskell.org/pipermail/libraries/2004-July/002352.html Which reinforces Gwern's point that the library should have both. - Perl, Python, and Ruby do not agree on split semantics: http://www.haskell.org/pipermail/libraries/2004-July/002351.html - there seems to be some desire for functions splitBy and tokens, which haven't been mentioned (yet) in this thread, I think. And finally my opinion: ByteString split:
split 'a' "aXaXaXa" == ["","X","X","X",""]
Neil's:
split' 'a' "aXaXaXa" == ["","X","X","X"]
I prefer the ByteString split (i.e. I prefer separator semantics over terminator), also because it's invertible. Some more support: http://www.haskell.org/pipermail/haskell-cafe/2006-July/016649.html Alistair

Hi
ByteString split:
split 'a' "aXaXaXa" == ["","X","X","X",""]
Neil's:
split' 'a' "aXaXaXa" == ["","X","X","X"]
I prefer the ByteString split (i.e. I prefer separator semantics over terminator), also because it's invertible. Some more support: http://www.haskell.org/pipermail/haskell-cafe/2006-July/016649.html
I prefer the ByteString split as well! That should be the one we use. Thanks Neil

On Fri, 11 Jul 2008, Neil Mitchell wrote:
Hi
ByteString split:
split 'a' "aXaXaXa" == ["","X","X","X",""]
Neil's:
split' 'a' "aXaXaXa" == ["","X","X","X"]
I prefer the ByteString split (i.e. I prefer separator semantics over terminator), also because it's invertible. Some more support: http://www.haskell.org/pipermail/haskell-cafe/2006-July/016649.html
I prefer the ByteString split as well! That should be the one we use.
A minor issue is, that 'split' does always return a non-empty list, which is not expressed by the list type of the result, but it would be nicely expressed by the alternating list type. In some cases it is necessary to handle the first or the last element differently, which requires that there is a first or a last element.

The big problem I see with split is the naming, which is already confused to a certain extent. It seems to me sensible that just as inits is the list of successive applications of init, and tails of tails, splits should be the list of successive applications of split. The analogy with inits and tails goes further, though: splits should be the list of all possible ways to generate contiguous bipartitions of a given list. Ironically, I was looking for just this function earlier today, while thinking about generating permutations. These ideas together suggest that split :: ([a], [a]) -> ([a], [a]) split (_, []) = error "split of empty list" split (l, r : rs) = (l ++ [r], rs) splits :: [a] -> [([a], [a])] splits l = splits' ([], l) where splits' l@(_, []) = [l] splits' l = l : splits' (split l) (I can rewrite splits using unfoldr, but to my mind it looks even worse. Is there some way to get the reflexive transitive closure of a function defined over a Maybe range, using fix or somesuch, that I'm missing?) We thus have, for example split [("1", "23")] == [("12", "3")] splits "123" == [("","123"),("1","23"),("12","3"),("123","")] This also dovetails nicely with what splitAt does, via the identity splitAt i l == splits l !! i (for i in the domain of !!). The function everyone has been asking for, it seems to me, is actually the right inverse of intercalate. This suggests that it might be named "deintercalate", which would be awesome assuming we hate each other and ourselves. "unintercalate" is even worse, since "intercalate" is the "unwords" analog and "unintercalate" would be the "words" analog. (How did that happen, anyway?) Microsoft Encarta Thesaurus suggests (http://au.encarta.msn.com/thesaurus_1861859438/intercalate.html) "extrapolate" as an antonym for "intercalate", but that doesn't even seem correct, much less sensible in this context. Perhaps "striate" would be acceptable? Note that the right thing sort of happens here: we have (for the right definition of striate) that intercalate sep . striate sep is the identity on lists, which seems like a good thing. I'd suggest that a reasonable definition of striate might be striate :: (Eq a) => [a] -> [a] -> [[a]] striate _ [] = [[]] striate sep l | isPrefixOf sep l = [] : striate sep (drop (length sep) l) striate sep (e : es) = (e : l') : ls' where (l' : ls') = striate sep es which seems to me pretty natural except for that odd definition on the empty list, but I think that's arguably a feature. (Probably there's some clever fold or something I'm missing here. Oh well.) Once one has all the split machinery for contiguous bipartions represented as two-tuples of lists, of course, one might naturally start to think about generalizing the whole mess to contiguous k-partitions or just contiguous partitions, represented as lists of lists. But I think I'm done for now. I'm sure everyone is relieved. Bart Massey bart@cs.pdx.edu

Bart Massey
I'd suggest that a reasonable definition of striate might be
striate :: (Eq a) => [a] -> [a] -> [[a]] striate _ [] = [[]] striate sep l | isPrefixOf sep l = [] : striate sep (drop (length sep) l) striate sep (e : es) = (e : l') : ls' where (l' : ls') = striate sep es
which seems to me pretty natural except for that odd definition on the empty list, but I think that's arguably a feature. (Probably there's some clever fold or something I'm missing here. Oh well.)
I know no one's listening, but I thought I'd correct myself on a couple of minor points, for the record. [BTW, the above paragraph was originally placed above the quoted text. As a result, gmane decided I was "top posting", and wouldn't let me continue. That's really obnoxious; I won't be posting here again until that bug is fixed.] First of all, the correct geologic antonym for intercalate is stratify, not striate. Second of all, my definition fails badly with the empty separator. We can argue about whether it should do anything in this case, but it should at least throw an error instead of generating an infinite list of nils. My next shot at this: stratify :: (Eq a) => [a] -> [a] -> [[a]] stratify [] l = map (:[]) l stratify _ [] = [[]] stratify sep l | isPrefixOf sep l = [] : stratify sep (drop (length sep) l) striate sep (e : es) = (e : l') : ls' where (l' : ls') = stratify sep es The case of stratify [] [] is ugly now, though. I dunno. Bart Massey bart@cs.pdx.edu

On 2008 Jul 18, at 5:14, Bart Massey wrote:
[BTW, the above paragraph was originally placed above the quoted text. As a result, gmane decided I was "top posting", and wouldn't let me continue. That's really obnoxious; I won't be posting here again until that bug is fixed.]
Try Nabble instead? (Except I don't see libraries@haskell.org there.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH
participants (10)
-
Alistair Bayley
-
Bart Massey
-
Brandon S. Allbery KF8NH
-
Curt Sampson
-
David Roundy
-
Eric Torreborre
-
Gwern Branwen
-
Henning Thielemann
-
Jonathan Cast
-
Neil Mitchell