Proposal for Data.List.splitBy

We seem to have not added any splitters such as splitBy or split to Data.List despite the discussions. Here is yet another perspective with the hope that it will make it. For the record, see the threads * http://www.haskell.org/pipermail/libraries/2004-July/thread.html#2342 * http://www.haskell.org/pipermail/haskell-cafe/2006-July/thread.html#16559 * http://www.haskell.org/pipermail/libraries/2008-January/thread.html#8922 see the article * http://www.haskell.org/pipermail/libraries/2006-October/006072.html and see the wiki page * http://haskell.org/haskellwiki/List_function_suggestions Essentially, I believe that the difficulty in choosing a splitter for Data.List comes from the fact that it does not take much of a problem to solve before a simple splitter is just not enough whereupon you should just use a FSM, Alex, or Parsec for example. A way forward would be to state a few meta-properties and properties up front and then see if this leads to a reasonable solution for the simple problems that just need a solution that is "good enough." I would suggest that 1. a set of splitter words should be small in number, 2. such a set should fit within the spirit of the List module of the Haskell 98 standard libraries, 3. it should pick up where break and span left off, and 4. the results coming from this set of splitter words should be unsplittable. I would argue that the above suggestions imply at least two words * splitBy :: (a -> Bool) -> [a] -> [([a], [a])] and * split :: [a] -> [a] -> [([a], [a])] such that splitBy takes a predicate p, e.g., (=='\n'), and a list xs and such that split takes a list of equivalent delimiters ds, e.g., " \t", and a list xs. The functions splitBy and split should at least have the follow properites.
(concatMap (\(x1, x2) -> x1 ++ x2) $ splitBy p xs) == xs splitBy _ [] == [([],[])] splitBy (\x -> False) xs == [(xs,[])] splitBy (\x -> True) xs == [([],[x1]),([],[x2]),([],[x3]), ...]
and
(concatMap (\(x1, x2) -> x1 ++ x2) $ split ds xs) == xs split _ [] == [([],[])] split [] xs == [(xs,[])] split xs xs == [([],[x1]),([],[x2]), ([],[x3]), ...]
where xs == x1:x2:x2:...:[]. Examples would be
splitBy (=='/') "/a/b" == [("","/"),("a","/"),("b","")] splitBy (=='/') "//aa//bb" == [("","/"),("","/"),("aa","/"),("","/"),("bb","")] splitBy (=='\n') "\na\nb" == [("","\n"),("a","\n"),("b","")] split xs xs == [([],[x1]),([],[x2]), ([],[x3]), ...] split " \t" "a\tb \tc\t d " == [("a","\t"),("b"," "),("","\t"),("c","\t"),(""," "),("d"," ")] split "/" "/a/b" == [("","/"),("a","/"),("b","")]
This leaves the one remaining "simple" case which is splitting using a list such as "\r\n" which splitBy and split cannot handle easily. This leads to * splitUsing :: (Eq a) => [a] -> [a] -> [([a], [a])] such that
(concatMap (\(x1, x2) -> x1 ++ x2) $ splitUsing xs xs') == xs' splitUsing _ [] == [([] , [])] splitUsing [] xs == [(xs , [])] splitUsing xs xs == [([] , xs)]
Examples would be
splitUsing "\r\n" "a" == [("a", "")] splitUsing "\r\n" "\r\n" == [("" , "\r\n")] splitUsing "\r\n" "a\r\n" == [("a", "\r\n")] splitUsing "\r\n" "a\r\n\r\n" == [("a", "\r\n"), ("" , "\r\n")] splitUsing "\r\n" "a\r\nb" == [("a", "\r\n"), ("b", "")] splitUsing "\r\n" "a\r\nb\r\n" == [("a", "\r\n"), ("b", "\r\n")]
Wrapping up, based on a sort of logical extension, you could imagine * splitWhere :: ([a] -> Bool) -> [a] -> [([a], [a])] but I believe that this fourth word is one too many, cute, but not necessary. If the above three splitter words are not enough, then I would say that you need a more powerful tool beyond the spirit of the Haskell 98 standard List module. Just to check, the Haskell 98 List.lines function would be defined as
lines98 xs = map fst $ split "\n" xs
and proper lines' and unlines' functions would be
lines' :: String -> Either [String] [String] lines' [] = Left [] lines' xs = let res = split "\n" xs nul = (null.snd.last) res in (if nul then Left else Right) $ map fst res
unlines' :: Either [String] [String] -> String unlines' exss = intercalate "\n" $ either (id) (++[[]]) exss
such that
unlines' . lines' == id
This is to say, (Right [String]) if the original xs ends in '\n' and (Left [String]) if it does not. For your review, the attached file Split.hs defines and documents splitBy, split, splitUsing, and splitWhere, and the attached file SplitProperties.hs defines tests for these same words. Finally, I would propose adding splitBy, split, and splitUsing to Data.List as three words that fit within the spirit of the Haskell 98 List module, take off where break and span leave off, are unsplittable, and are "good enough". Cheers, - Marcus -- Marcus D. Gabriel, Ph.D. Saint Louis, FRANCE http://www.marcus.gabriel.name mailto:marcus@gabriel.name Tel: +33.3.89.69.05.06 Portable: +33.6.34.56.07.75

On Sun, 4 Jan 2009, Marcus D. Gabriel wrote:
We seem to have not added any splitters such as splitBy or split to Data.List despite the discussions. Here is yet another perspective with the hope that it will make it.
I assumed that the discussion was resolved by http://hackage.haskell.org/cgi-bin/hackage-scripts/package/split

Henning Thielemann wrote:
On Sun, 4 Jan 2009, Marcus D. Gabriel wrote:
We seem to have not added any splitters such as splitBy or split to Data.List despite the discussions. Here is yet another perspective with the hope that it will make it.
I assumed that the discussion was resolved by http://hackage.haskell.org/cgi-bin/hackage-scripts/package/split
I did not have that impression. I saw, read, and played with this package, but some how I just thought that it was a shame that in 98 one did not continue just a little bit farther beyond span and break in the the List module and make splitBy and split, at least. These two with something to handle "\r\n" would do much for the simple cases. After this, give up and use a bigger hammer such as Alex, Parsec, etc. Maybe the correct stopping point is break and span whereupon you find a package at the correct level for what you need. But my feeling is that if splitBy and split had been in all along, they would have been useful, and the next step would be other modules and packages as we have. -- Marcus D. Gabriel, Ph.D. Saint Louis, FRANCE http://www.marcus.gabriel.name mailto:marcus@gabriel.name Tel: +33.3.89.69.05.06 Portable: +33.6.34.56.07.75

On 2009 Jan 4, at 17:59, Marcus D. Gabriel wrote:
Henning Thielemann wrote:
On Sun, 4 Jan 2009, Marcus D. Gabriel wrote:
We seem to have not added any splitters such as splitBy or split to Data.List despite the discussions. Here is yet another perspective with the hope that it will make it.
I assumed that the discussion was resolved by http://hackage.haskell.org/cgi-bin/hackage-scripts/package/split
I did not have that impression.
The discussion demonstrated why there is no standard split in Data.List: nobody can agree on how it should work. Hence the Split package, which should allow us to find out which ones get used and which don't, and which are viable candidates for addition to Data.List. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Brandon S. Allbery KF8NH wrote:
On 2009 Jan 4, at 17:59, Marcus D. Gabriel wrote:
Henning Thielemann wrote:
On Sun, 4 Jan 2009, Marcus D. Gabriel wrote:
We seem to have not added any splitters such as splitBy or split to Data.List despite the discussions. Here is yet another perspective with the hope that it will make it.
I assumed that the discussion was resolved by http://hackage.haskell.org/cgi-bin/hackage-scripts/package/split
I did not have that impression.
The discussion demonstrated why there is no standard split in Data.List: nobody can agree on how it should work. Hence the Split package, which should allow us to find out which ones get used and which don't, and which are viable candidates for addition to Data.List.
I did not really interpret the Split package this way, but thank you for pointing this out to me. Let me follow up with Duncan's response and see if I can explain myself better. -- Marcus D. Gabriel, Ph.D. Saint Louis, FRANCE http://www.marcus.gabriel.name mailto:marcus@gabriel.name Tel: +33.3.89.69.05.06 Portable: +33.6.34.56.07.75

On Sun, 2009-01-04 at 23:25 +0100, Henning Thielemann wrote:
On Sun, 4 Jan 2009, Marcus D. Gabriel wrote:
We seem to have not added any splitters such as splitBy or split to Data.List despite the discussions. Here is yet another perspective with the hope that it will make it.
I assumed that the discussion was resolved by http://hackage.haskell.org/cgi-bin/hackage-scripts/package/split
We should say that the split package is a way of continuing the discussion. The hope of some around here is that we can find a consensus on a small set of common split functions and add them to the Data.List module. I'm not sure if that goal is achievable but I think it's worth trying. Duncan

Duncan Coutts wrote:
On Sun, 2009-01-04 at 23:25 +0100, Henning Thielemann wrote:
On Sun, 4 Jan 2009, Marcus D. Gabriel wrote:
We seem to have not added any splitters such as splitBy or split to Data.List despite the discussions. Here is yet another perspective with the hope that it will make it.
I assumed that the discussion was resolved by http://hackage.haskell.org/cgi-bin/hackage-scripts/package/split
We should say that the split package is a way of continuing the discussion. The hope of some around here is that we can find a consensus on a small set of common split functions and add them to the Data.List module. I'm not sure if that goal is achievable but I think it's worth trying.
Duncan
From the e-mail threads, I would say the above are the absolute minimum
Let me see if I can explain what I was trying to bring to the discussion. Forget the code examples that I gave for now. You should realize that I am someone who was reading the discussion threads and felt that the group had the answer or answers, that there was no fighting going on as someone wrote somewhere, and that the key properties that the splitter functions should posses is contained in the e-mail threads but not grouped together. In other words, it was a nice discussion to read. So, let me formulate this from my point of view of how this comes together as I have read it. 1/ It's worth having some splitter functions in the Data.List module, so this is the goal. Otherwise, one leaves Data.List alone and the discussion should be whether or not one should simply use a Data.List.Split package to handle a set of useful idioms somewhere between Data.List and Text.ParserCombinators.Parsec in complexity. Possibly one would flush out Data.List.Split and even write it with a pedantic theme for new comers to Haskell for example. 2/ Some properties, in a loose sense of the term, are needed to constrain the choices and from the e-mail threads, I would say that this sums up as the following. P1. The number of splitter functions should be few in number, otherwise, make a package. P2. There should be no information loss, that is, keep the delimiters, keep the separators, keep the parts of the original list xs that satisfy a predicate p, do not lose information about the beginning and the end of the list relative to the first and last elements of the list respectively. The user of the function decides what to discard. P3. A split list should be unsplittable so as to recover the original list xs. (I made up the word unsplittable.) (P2 implies P3, but let us state this anyway.) properties that the solution should have. If there is no agreement here, this would be interesting and should be discussed. I would go on to write that the e-mail threads and the properties above strongly imply in my mind the follow property: P4. A split list xs' should have the same relative order as the original list xs. (I am sure everyone finds this obvious, and in the e-mail threads I do not remember an exception, so let us state this clearly.) The next properties are not exactly what I read from the e-mail threads and are really more my personal view, but they are not in contradiction with what I read either. P5. The splitter functions should fit within the spirit of the Data.List module and even the original Haskell 98 List module in terms of type signature and complexity of implementation. P6. There should be at least one splitter function that takes a predicate p, a "by" word such as groupBy or nubBy. If there is agreement about P1 to P6, then this is a way to frame the discussion. If not, then I would say that this needs to be determined first, otherwise the goal stated above will be difficult to reach. The function intercalate was probably easy to add because it grasps a nice idiom, but with splitter functions there is the problem of adding more and more words and moving closer and closer to the power and complexity of Parsec for example. This next property is very much my point of view: P7. The splitter functions should take off where break and scan left off. I add this one because when I was first learning Haskell, I was using the List module quite a bit. I needed a splitBy word, and I thought that break and scan were for what I was looking, but they were not. So I wrote
splitBy :: (a -> Bool) -> [a] -> [([a], [a])] splitBy p xs = let (hd1, tl1) = break p xs (hd2, tl2) = span p tl1 in (hd1, hd2):(if null tl2 then [] else splitBy p tl2)
which I eventually came to not like. To sum up, if we can agree that a set of splitter properties should fit within the context of Properties P1 to P7, then maybe forward movement can be made. If we cannot agree on these properties or another set, then splitter function idioms are too idiomatic, and a larger set of words are needed, so the investment should be in Data.List.Split. Cheers, - Marcus -- Marcus D. Gabriel, Ph.D. Saint Louis, FRANCE http://www.marcus.gabriel.name mailto:marcus@gabriel.name Tel: +33.3.89.69.05.06 Portable: +33.6.34.56.07.75

P2. There should be no information loss, that is, keep the delimiters, keep the separators, keep the parts of the original list xs that satisfy a predicate p, do not lose information about the beginning and the end of the list relative to the first and last elements of the list respectively. The user of the function decides what to discard.
P3. A split list should be unsplittable so as to recover the original list xs. (I made up the word unsplittable.) (P2 implies P3, but let us state this anyway.)
I'm not sure I agree with this. The problem is that much (most?) of the time, people looking for a split function want to discard delimiters; for example, if you have a string like "foo;bar;baz" and you want to split it into ["foo","bar","baz"]. In this case it's really annoying to have to throw away the delimiters yourself, especially if you just get back a list like ["foo",";","bar",";","baz"] and have to decide which things are delimiters and which aren't, with no help from the type system. But, as you noted, throwing away information like this is bad from an elegance/formal properties point of view. This is exactly why I designed the Data.List.Split library as I did: the core internal splitting function is information-preserving, and by using various combinators the user can choose to throw away whatever information they are not interested in. -Brent

On Sun, Jan 11, 2009 at 3:56 PM, Brent Yorgey
P2. There should be no information loss, that is, keep the delimiters, keep the separators, keep the parts of the original list xs that satisfy a predicate p, do not lose information about the beginning and the end of the list relative to the first and last elements of the list respectively. The user of the function decides what to discard.
P3. A split list should be unsplittable so as to recover the original list xs. (I made up the word unsplittable.) (P2 implies P3, but let us state this anyway.)
I'm not sure I agree with this. The problem is that much (most?) of the time, people looking for a split function want to discard delimiters; for example, if you have a string like "foo;bar;baz" and you want to split it into ["foo","bar","baz"]. In this case it's really annoying to have to throw away the delimiters yourself, especially if you just get back a list like ["foo",";","bar",";","baz"] and have to decide which things are delimiters and which aren't, with no help from the type system. But, as you noted, throwing away information like this is bad from an elegance/formal properties point of view. This is exactly why I designed the Data.List.Split library as I did: the core internal splitting function is information-preserving, and by using various combinators the user can choose to throw away whatever information they are not interested in.
-Brent _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
My personal opinion here is that there should be no split function in Data.List. Brent has put together a very nice splitting package; if we want this to be more accessible, we should put it in the Haskell Platform or whatever "blessing" procedure is established. Putting the split functions back into Data.List after they've already been added to a comprehensive splitting library is just going to be arbitrary no matter which ones we choose. Regards, Alex

On Mon, 2009-01-12 at 21:36 -0800, Alexander Dunlap wrote:
My personal opinion here is that there should be no split function in Data.List. Brent has put together a very nice splitting package; if we want this to be more accessible, we should put it in the Haskell Platform or whatever "blessing" procedure is established.
That is a possibility.
Putting the split functions back into Data.List after they've already been added to a comprehensive splitting library is just going to be arbitrary no matter which ones we choose.
Not necessarily arbitrary. The Data.List collection is not some canonical orthogonal basis of all list functions, it's a collection of frequently used (and mostly orthogonal, composable etc) list functions. So it may be that by looking at how split functions are used in Haskell code in general that we can identify one or two that are both reasonably general and used reasonably frequently. On the other hand we may find that there are not! :-) Duncan

Brent Yorgey wrote:
P2. There should be no information loss, that is, keep the delimiters, keep the separators, keep the parts of the original list xs that satisfy a predicate p, do not lose information about the beginning and the end of the list relative to the first and last elements of the list respectively. The user of the function decides what to discard.
P3. A split list should be unsplittable so as to recover the original list xs. (I made up the word unsplittable.) (P2 implies P3, but let us state this anyway.)
I'm not sure I agree with this.
Thanks for stating this. Dropping P3 would change my thinking about this topic, that is, if we drop P3, then I would prefer that no splitter functions are added to Data.List and that it is left as is.
The problem is that much (most?) of the time, people looking for a split function want to discard delimiters; for example, if you have a string like "foo;bar;baz" and you want to split it into ["foo","bar","baz"].
I agree with this comment when thinking about strings and what I would do most of the time and from a pragmatic point of view.
In this case it's really annoying to have to throw away the delimiters yourself, especially if you just get back a list like ["foo",";","bar",";","baz"] and have to decide which things are delimiters and which aren't,
I certainly understand this point, however,
with no help from the type system.
-- P5. The splitter functions should fit within the spirit of the -- Data.List module and even the original Haskell 98 List module -- in terms of type signature and complexity of implementation. In my mind, the idea of adding a few splitter functions to Data.List does not preclude Data.List.Split. From my perspective of P5 above, within the spirit of Data.List, you can work with things like a, [a], (a,b), Maybe a, Eq a, ... . If you wish to do more, a separate module would be in order as you have done.
But, as you noted, throwing away information like this is bad from an elegance/formal properties point of view. This is exactly why I designed the Data.List.Split library as I did: the core internal splitting function is information-preserving, and by using various combinators the user can choose to throw away whatever information they are not interested in.
Perfect. So, as you see it, are there one, two, or three functions in or hiding in Data.List.Split.Internals that can be factored and placed into Data.List that are in line with P1 to P7? You do not actually need to agree to P1 to P7, it is a conceptual exercise. The idea is that Data.List.Split would flow more naturally from Data.List with these few functions added to it. Finally, as concrete examples or to clarify points, the words split, delimiter, separator and variations thereof have been used. This already implies a theme. Do you conceptualize of Data.List.Split as primarily to help programmers from other backgrounds to be able to manipulate strings, that is, supply some nice idioms but generalized from [Char] to [a]? If I were to write organizeBy :: ([a] -> Bool) -> [a] -> [([a], [a])] could you think of a specification such that this function would be a work horse in implementing Data.List.Split.Internals and Data.List.Split? Alex had the point of view later in this thread that now that Data.List.Split exists, anything that we move to Data.List will be arbitrary in the cutoff. Duncan responded by advancing the idea that by examining what is happening in Haskell code, we may find a few useful functions for Data.List. My intermediate idea would be to examine Data.List.Split and Data.List.Split.Internals and think about factoring very general idioms that could be placed into Data.List, would be the work horse for implementing Data.List.Split.Internals and Data.List.Split, and would be in line with P1 to P7, which I acknowledge is my point of view. If a few words could be found that fit the above, they would merit Data.List. Finally, this whole thread brings up the question in my mind about module design. As work is put into Data.List.Split, what is the guiding principle that prevents it from becoming Data.List.Extensions or to be a bit more direct, Data.List.TheFunctionsThatWereForgotten? At http://haskell.org/haskellwiki/Data.List.Split, we have
An important caveat: we should strive to keep things flexible yet SIMPLE. The more complicated things get, the closer this gets to just being a general parsing or regex library. So the right balance needs to be struck.
I agree, and we have
A theoretical module which contains implementations/combinators for implementing every possible method of list-splitting known to man. This way no one has to argue about what the correct interface for split is, we can just have them all.
Is not this Data.List? In other words, what idea or theme does a new Haskell programmer use to decide to first look into Data.List as opposed to Data.List.Split and vice versa? Cheers, - Marcus -- Marcus D. Gabriel, Ph.D. Saint Louis, FRANCE http://www.marcus.gabriel.name mailto:marcus@gabriel.name Tel: +33.3.89.69.05.06 Portable: +33.6.34.56.07.75

On Sun, 2009-01-18 at 12:02 +0100, Marcus D. Gabriel wrote:
Brent Yorgey wrote:
P2. There should be no information loss, that is, keep the delimiters, keep the separators, keep the parts of the original list xs that satisfy a predicate p, do not lose information about the beginning and the end of the list relative to the first and last elements of the list respectively. The user of the function decides what to discard.
P3. A split list should be unsplittable so as to recover the original list xs. (I made up the word unsplittable.) (P2 implies P3, but let us state this anyway.)
I'm not sure I agree with this.
Thanks for stating this. Dropping P3 would change my thinking about this topic, that is, if we drop P3, then I would prefer that no splitter functions are added to Data.List and that it is left as is.
The problem is that much (most?) of the time, people looking for a split function want to discard delimiters; for example, if you have a string like "foo;bar;baz" and you want to split it into ["foo","bar","baz"].
I agree with this comment when thinking about strings and what I would do most of the time and from a pragmatic point of view.
Indeed, the existing Data.List.words is certainly lossy and deliberately so. It's also useful and widely used. On the other hand it is a widely held view that Data.List.lines should not be lossy, ie that Data.List.unlines . Data.List.lines should be the identity. In the current implementation of lines . unlines it is not the case because of the way it handles a trailing newline. Duncan

Duncan Coutts wrote:
On Sun, 2009-01-18 at 12:02 +0100, Marcus D. Gabriel wrote:
Brent Yorgey wrote:
P2. There should be no information loss, that is, keep the
delimiters,
keep the separators, keep the parts of the original list xs that
satisfy
a predicate p, do not lose information about the beginning and the
end
of the list relative to the first and last elements of the list respectively. The user of the function decides what to discard.
P3. A split list should be unsplittable so as to recover the original list xs. (I made up the word unsplittable.) (P2 implies P3, but let us state this anyway.)
I'm not sure I agree with this.
Thanks for stating this. Dropping P3 would change my thinking about this topic, that is, if we drop P3, then I would prefer that no splitter functions are added to Data.List and that it is left as is.
The problem is that much (most?) of the time, people looking for a split function want to discard delimiters; for example, if you have a string like "foo;bar;baz" and you want to split it into ["foo","bar","baz"].
I agree with this comment when thinking about strings and what I would do most of the time and from a pragmatic point of view.
Indeed, the existing Data.List.words is certainly lossyand deliberately so. It's also useful and widely used.
On the other hand it is a widely held view that Data.List.lines should not be lossy, ie that Data.List.unlines . Data.List.lines should be the identity. In the current implementation of lines . unlines it is not the case because of the way it handles a trailing newline.
Duncan
An argument for not placing any fundamental splitter functions in Data.List that are lossy if I ever read one. The user of these functions should explicitly choose to lose information. Then the documentation in the Haskell 98 report might have stated instead something like unlines . lines == id iff xs ends with '\n' which would at least be up front. Cheers, - Marcus

On Sun, 2009-01-18 at 15:11 +0100, Marcus D. Gabriel wrote:
Duncan Coutts wrote:
On Sun, 2009-01-18 at 12:02 +0100, Marcus D. Gabriel wrote:
Brent Yorgey wrote:
P2. There should be no information loss, that is, keep the
delimiters,
keep the separators, keep the parts of the original list xs that
satisfy
a predicate p, do not lose information about the beginning and the
end
of the list relative to the first and last elements of the list respectively. The user of the function decides what to discard.
P3. A split list should be unsplittable so as to recover the original list xs. (I made up the word unsplittable.) (P2 implies P3, but let us state this anyway.)
I'm not sure I agree with this.
Thanks for stating this. Dropping P3 would change my thinking about this topic, that is, if we drop P3, then I would prefer that no splitter functions are added to Data.List and that it is left as is.
The problem is that much (most?) of the time, people looking for a split function want to discard delimiters; for example, if you have a string like "foo;bar;baz" and you want to split it into ["foo","bar","baz"].
I agree with this comment when thinking about strings and what I would do most of the time and from a pragmatic point of view.
Indeed, the existing Data.List.words is certainly lossyand deliberately so. It's also useful and widely used.
On the other hand it is a widely held view that Data.List.lines should not be lossy, ie that Data.List.unlines . Data.List.lines should be the identity. In the current implementation of lines . unlines it is not the case because of the way it handles a trailing newline.
An argument for not placing any fundamental splitter functions in Data.List that are lossy if I ever read one.
The user of these functions should explicitly choose to lose information. Then the documentation in the Haskell 98 report might have stated instead something like
unlines . lines == id iff xs ends with '\n'
which would at least be up front.
Of course, the properties should have been specified. But are you also really saying that 'words' should have been omitted from the List module? Duncan

Duncan Coutts wrote:
On Sun, 2009-01-18 at 15:11 +0100, Marcus D. Gabriel wrote:
Duncan Coutts wrote:
On Sun, 2009-01-18 at 12:02 +0100, Marcus D. Gabriel wrote:
Brent Yorgey wrote:
P2. There should be no information loss, that is, keep the
delimiters,
keep the separators, keep the parts of the original list xs that
satisfy
a predicate p, do not lose information about the beginning and the
end
of the list relative to the first and last elements of the list respectively. The user of the function decides what to discard.
P3. A split list should be unsplittable so as to recover the original list xs. (I made up the word unsplittable.) (P2 implies P3, but let us state this anyway.)
I'm not sure I agree with this.
Thanks for stating this. Dropping P3 would change my thinking about this topic, that is, if we drop P3, then I would prefer that no splitter functions are added to Data.List and that it is left as is.
The problem is that much (most?) of the time, people looking for a split function want to discard delimiters; for example, if you have a string like "foo;bar;baz" and you want to split it into ["foo","bar","baz"].
I agree with this comment when thinking about strings and what I would do most of the time and from a pragmatic point of view.
Indeed, the existing Data.List.words is certainly lossyand deliberately so. It's also useful and widely used.
On the other hand it is a widely held view that Data.List.lines should not be lossy, ie that Data.List.unlines . Data.List.lines should be the identity. In the current implementation of lines . unlines it is not the case because of the way it handles a trailing newline.
An argument for not placing any fundamental splitter functions in Data.List that are lossy if I ever read one.
The user of these functions should explicitly choose to lose information. Then the documentation in the Haskell 98 report might have stated instead something like
unlines . lines == id iff xs ends with '\n'
which would at least be up front.
Of course, the properties should have been specified. But are you also really saying that 'words' should have been omitted from the List module?
Good catch, very good catch. Historically, it is there, so pragmatically, it stays. This go backs to my question about module, package, or library design. A set of Data.List design criteria could have been set up such that
unlines :: [String] -> String unwords :: [String] -> String
were not in List of Haskell 98 but in a smaller, specific module, something similar to Data.Char for example. I recall that when I was first learning Haskell and browsed List under Hugs, these two functions stood out compared to everything else simply because of their type signature. Well Duncan, you kind of undermined the perspective that I was trying to construct. You pushed my view to Alex's one about simply leaving Data.List alone and moving forward with Data.List.Split. Assuming this, then from a very pragmatic point of view, I would tell a new programmer to Haskell to study Data.List first, and if you do not find for what you are looking, try Data.List.Split. Still, I am curious what Brent will write. Cheers, - Marcus

On Sun, Jan 18, 2009 at 12:02:43PM +0100, Marcus D. Gabriel wrote:
But, as you noted, throwing away information like this is bad from an elegance/formal properties point of view. This is exactly why I designed the Data.List.Split library as I did: the core internal splitting function is information-preserving, and by using various combinators the user can choose to throw away whatever information they are not interested in.
Perfect. So, as you see it, are there one, two, or three functions in or hiding in Data.List.Split.Internals that can be factored and placed into Data.List that are in line with P1 to P7? You do not actually need to agree to P1 to P7, it is a conceptual exercise. The idea is that Data.List.Split would flow more naturally from Data.List with these few functions added to it.
Not really. You are welcome to look at the source of Data.List.Split.Internals yourself; you will see that the core functions on top of which everything else is implemented use various internal data types, to more accurately reflect the richness of information that is present in the most general case. If we were to add these core functions to Data.List.Split, we would have to add these new data types as well, which seems to go against the spirit of simplicity found in Data.List. And in any case, the core functions are not very easy or convenient to use on their own. Let me be clear---I am not against adding some splitting functions to Data.List, if consensus as to what these functions should be is reached. But literally pulling a few things out of Data.List.Split as you propose is not the way to go, as I think you will agree if you look at the source.
Finally, as concrete examples or to clarify points, the words split, delimiter, separator and variations thereof have been used. This already implies a theme. Do you conceptualize of Data.List.Split as primarily to help programmers from other backgrounds to be able to manipulate strings, that is, supply some nice idioms but generalized from [Char] to [a]?
No, I see it as a useful tool for manipulating lists in general.
If I were to write
organizeBy :: ([a] -> Bool) -> [a] -> [([a], [a])]
could you think of a specification such that this function would be a work horse in implementing Data.List.Split.Internals and Data.List.Split?
I could, but I think the result would be rather uglier and harder to understand than the current implementation.
Finally, this whole thread brings up the question in my mind about module design. As work is put into Data.List.Split, what is the guiding principle that prevents it from becoming Data.List.Extensions or to be a bit more direct, Data.List.TheFunctionsThatWereForgotten?
Are you serious? It seems quite clear to me that Data.List.Split will *not* turn into that. For one thing, it contains only list-splitting functions; for another thing, I do not foresee putting that much more work into it.
A theoretical module which contains implementations/combinators for implementing every possible method of list-splitting known to man. This way no one has to argue about what the correct interface for split is, we can just have them all.
Is not this Data.List? In other words, what idea or theme does a new Haskell programmer use to decide to first look into Data.List as opposed to Data.List.Split and vice versa?
I was being sort of facetious in that quote. =) -Brent

Brent Yorgey wrote:
Not really. You are welcome to look at the source of Data.List.Split.Internals yourself;
Fair enough, I will look at it.
Let me be clear---I am not against adding some splitting functions to Data.List, if consensus as to what these functions should be is reached.
Not to worry, this was clear to me. I am sorry if my thread obscured this.
If I were to write
organizeBy :: ([a] -> Bool) -> [a] -> [([a], [a])]
could you think of a specification such that this function would be a work horse in implementing Data.List.Split.Internals and Data.List.Split?
I could, but I think the result would be rather uglier and harder to understand than the current implementation.
I understand. I do not perceive that I have expressed very well the way that I am trying to conceptualise of this, but the observation of yours above makes me conclude that adding a few splitter functions to Data.List would just be arbitrary. (Alex already stated this point of view.)
Finally, this whole thread brings up the question in my mind about module design. As work is put into Data.List.Split, what is the guiding principle that prevents it from becoming Data.List.Extensions or to be a bit more direct, Data.List.TheFunctionsThatWereForgotten?
Are you serious?
Well, actually yes, I was serious, but see below.
It seems quite clear to me that Data.List.Split will *not* turn into that. For one thing, it contains only list-splitting functions; for another thing, I do not foresee putting that much more work into it.
Ok, I was on a tangent. Given this above, no, I was not serious.
A theoretical module which contains implementations/combinators for implementing every possible method of list-splitting known to man. This way no one has to argue about what the correct interface for split is, we can just have them all.
Is not this Data.List? In other words, what idea or theme does a new Haskell programmer use to decide to first look into Data.List as opposed to Data.List.Split and vice versa?
I was being sort of facetious in that quote. =)
Ok, once again, I was on a tangent and drifting off topic. In my defence, my parents-in-law were here for my Son's birthday which in Alsace means a lot of good food and wine which also means I should not write e-mails at night after such an event :). -- So, to conclude, when I first started using Haskell 98, at one point in time I was looking for some splitter idioms in the List module. The functions break and span were close, but not it. The functions
lines :: String -> [String] unlines :: [String] -> String
words :: String -> [String] unwords :: [String] -> String
seemed out of place compared to everything else in the List module just because of their type signature whereupon it seemed even more odd that I could not just split /etc/passwd or /etc/group on ':' with something directly suited to the task that was in the List module as an example. Thanks for the thread, - Marcus -- Marcus D. Gabriel, Ph.D. Saint Louis, FRANCE http://www.marcus.gabriel.name mailto:marcus@gabriel.name

Marcus D. Gabriel wrote:
If I were to write
organizeBy :: ([a] -> Bool) -> [a] -> [([a], [a])]
I quite like your idea, but I think the input predicate "([a] -> Bool)" is too ambitious, although it would nicely unify Brent's data Delimiter a where DelimEltPred :: (a -> Bool) -> Delimiter a DelimSublist :: Eq a => [a] -> Delimiter a With the predicate ([a] -> Bool) you have to check all inits of your input to detect a delimiter and only if all inits are no delimiter you know the head element is not part of a delimiter and repeat checking the inits of the tail. I think this is too inefficient in general. In order to keep things simple I would vote for a split function that only takes the simple predicate (a -> Bool) and leaves the more complicated splitting to _additional_ functions (following http://haskell.org/haskellwiki/Simple_to_complex) For instance replace :: [a] -> a -> [a] -> [a] could replace a sublist (first argument) with a single element (second argument) This would help in simple cases, but leaves it to the user to choose a suitable delimiter element. But a general splitting on sublists could be implemented via splitBy isNothing (replace (map Just sl) Nothing (map Just l)) (For a fixed sublist sl, "Nothing" is enough to represent the delimiter) For a simple predicate "(a -> Bool)" it remains to discuss the output of splitBy. You've proposed "[([a], [a])]", but for the simpler case "[(a, [a])]" or "[([a], a)]" may do, but in order to capture the full information of lists before and after delimiters, something like "([a], [(a, [a])])" is needed. Since a tuple "(a, [a])" can be viewed as non-empty list of type "[a]", "([a], [(a, [a])])" collapses to a non-empty list of type "[[a]]" with a _tail_ of non-empty element lists. Therefore I propose the following splitBy as a "work horse". splitBy :: (a -> Bool) -> [a] -> [[a]] The implementation is simply using "break". The basic property is: concat (splitBy p l) == l (This is almost in the spirit of Data.List, since groupBy or words also produces non-empty element lists.) Getting rid of a final delimiter or all of them can be implemented in O(n) (see below). And finally: wordsBy p = filter (not . null) . dropDelims . splitBy p linesBy p = dropDelims . dropFinalDelim . splitBy p with: words s == wordsBy isSpace s lines s == linesBy (== '\n') s Surely, one might prefer a more type-safe version wrt non-empty lists, but the main idea is to use function composition rather than composing a (non-haskell98) splitter input data type as Brent does in his split package. Cheers Christian import Data.Char (isSpace) splitBy :: (a -> Bool) -> [a] -> [[a]] splitBy p l = let (fr, rt) = break p l in case rt of [] -> [fr] d : tl -> let hd : tll = splitBy p tl in fr : (d : hd) : tll dropFinalDelim :: [[a]] -> [[a]] dropFinalDelim ll@(l : ls) = if null ls then if null l then [] else ll else if null (tail (last ls)) then init ll else ll dropDelims :: [[a]] -> [[a]] dropDelims ll = case ll of [] -> [] l : ls -> l : map tail ls wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p = filter (not . null) . dropDelims . splitBy p linesBy :: (a -> Bool) -> [a] -> [[a]] linesBy p = dropDelims . dropFinalDelim . splitBy p prop_wordsBy :: String -> Bool prop_wordsBy s = words s == wordsBy isSpace s prop_linesBy :: String -> Bool prop_linesBy s = lines s == linesBy (== '\n') s replace :: Eq a => [a] -> a -> [a] -> [a] replace sl@(_ : _) r l = case l of [] -> l x : xs -> case stripPrefix sl l of Nothing -> x : replace sl r xs Just rt -> r : replace sl r rt subListSplit :: Eq a => [a] -> [a] -> [[Maybe a]] subListSplit sl@(_ : _) l = splitBy isNothing (replace (map Just sl) Nothing (map Just l)) unintercalate :: Eq a => [a] -> [a] -> [[a]] unintercalate sl@(_ : _) = map (map fromJust) . dropDelims . subListSplit sl

Christian Maeder wrote:
Marcus D. Gabriel wrote:
If I were to write
organizeBy :: ([a] -> Bool) -> [a] -> [([a], [a])]
I quite like your idea,
Thanks.
but I think the input predicate "([a] -> Bool)" is too ambitious, although it would nicely unify Brent's
data Delimiter a where DelimEltPred :: (a -> Bool) -> Delimiter a DelimSublist :: Eq a => [a] -> Delimiter a
With the predicate ([a] -> Bool) you have to check all inits of your input to detect a delimiter and only if all inits are no delimiter you know the head element is not part of a delimiter and repeat checking the inits of the tail. I think this is too inefficient in general.
The name "organizeBy" was a metaphor. I was trying to distance my language from words such as split, delimiter, and separator to something more general. As for the use of inits, yes, this is a performance hit. If you use something like
splitOnAList :: Eq a => [a] -> [a] -> <To Be Decided>
to split on a list such as "\r\n", then you can use isPrefixOf whereupon the performance is good enough (actually, its not bad at all). (http://www.haskell.org/pipermail/libraries/2009-January/011061.html) In retrospect, the idea of splitting, delimiters, separators, and fields is what this is all about anyway.
In order to keep things simple I would vote for a split function that only takes the simple predicate (a -> Bool) and leaves the more complicated splitting to _additional_ functions (following http://haskell.org/haskellwiki/Simple_to_complex)
For instance
replace :: [a] -> a -> [a] -> [a]
could replace a sublist (first argument) with a single element (second argument)
This would help in simple cases, but leaves it to the user to choose a suitable delimiter element.
But a general splitting on sublists could be implemented via
splitBy isNothing (replace (map Just sl) Nothing (map Just l))
(For a fixed sublist sl, "Nothing" is enough to represent the delimiter)
Nice. It took me a moment, but nice.
For a simple predicate "(a -> Bool)" it remains to discuss the output of splitBy.
You've proposed "[([a], [a])]", but for the simpler case "[(a, [a])]" or "[([a], a)]" may do,
Actually, although enticing, I do not believe that [(a, [a])] is possible due to the corner cases when there is no beginning non-delimiter or ending delimiter, that is, one needs [(Maybe a, [a])]. (Please check me on this.)
but in order to capture the full information of lists before and after delimiters, something like "([a], [(a, [a])])" is needed.
Since a tuple "(a, [a])" can be viewed as non-empty list of type "[a]", "([a], [(a, [a])])" collapses to a non-empty list of type "[[a]]" with a _tail_ of non-empty element lists.
I unfortunately do not follow you here, sorry. Be that as it may, I have come to appreciate the output [[a]].
Therefore I propose the following splitBy as a "work horse".
splitBy :: (a -> Bool) -> [a] -> [[a]]
This works for me.
The implementation is simply using "break". The basic property is:
concat (splitBy p l) == l
I would consider this a requirement, so this works for me.
(This is almost in the spirit of Data.List, since groupBy or words also produces non-empty element lists.)
Getting rid of a final delimiter or all of them can be implemented in O(n) (see below).
And finally:
wordsBy p = filter (not . null) . dropDelims . splitBy p linesBy p = dropDelims . dropFinalDelim . splitBy p
with:
words s == wordsBy isSpace s lines s == linesBy (== '\n') s
Surely, one might prefer a more type-safe version wrt non-empty lists, but the main idea is to use function composition rather than composing a (non-haskell98) splitter input data type as Brent does in his split package.
I am aligned with your point of view above in the sense of additions to Data.List. The more that I think about and play with your definition of the function replace, the more I like it. It captures an idiom that I have used on the UNIX command line when hacking away on text, that is, replace sequences with an unambiguous marker for later use. So, in summary, your idea would be to introduce two functions into Data.List:
splitBy :: (a -> Bool) -> [a] -> [[a]] replace :: Eq a => [a] -> a -> [a] -> [a]
Is this correct? If so, how would you define
splitOnAList :: Eq a => [a] -> [a] -> [[a]]
using splitBy and replace. For example,
splitOnAList "\r\n" "abc\r\nxyz\r\n" == ["abc","\r\n","xyz","\r\n"]
Given this, maybe we should look at all of the many ways that software makes CSV (TSV) files and see if splitBy and replace can reasonably handle the task. Actually, can even Data.List.Split reasonably handle the task? (I only just recalled this common little problem that is almost trivial but never really so.) Cheers, - Marcus
Cheers Christian
import Data.Char (isSpace)
splitBy :: (a -> Bool) -> [a] -> [[a]] splitBy p l = let (fr, rt) = break p l in case rt of [] -> [fr] d : tl -> let hd : tll = splitBy p tl in fr : (d : hd) : tll
dropFinalDelim :: [[a]] -> [[a]] dropFinalDelim ll@(l : ls) = if null ls then if null l then [] else ll else if null (tail (last ls)) then init ll else ll
dropDelims :: [[a]] -> [[a]] dropDelims ll = case ll of [] -> [] l : ls -> l : map tail ls
wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p = filter (not . null) . dropDelims . splitBy p
linesBy :: (a -> Bool) -> [a] -> [[a]] linesBy p = dropDelims . dropFinalDelim . splitBy p
prop_wordsBy :: String -> Bool prop_wordsBy s = words s == wordsBy isSpace s
prop_linesBy :: String -> Bool prop_linesBy s = lines s == linesBy (== '\n') s
replace :: Eq a => [a] -> a -> [a] -> [a] replace sl@(_ : _) r l = case l of [] -> l x : xs -> case stripPrefix sl l of Nothing -> x : replace sl r xs Just rt -> r : replace sl r rt
subListSplit :: Eq a => [a] -> [a] -> [[Maybe a]] subListSplit sl@(_ : _) l = splitBy isNothing (replace (map Just sl) Nothing (map Just l))
unintercalate :: Eq a => [a] -> [a] -> [[a]] unintercalate sl@(_ : _) = map (map fromJust) . dropDelims . subListSplit sl
-- Marcus D. Gabriel, Ph.D. Saint Louis, FRANCE http://www.marcus.gabriel.name mailto:marcus@gabriel.name

Marcus D. Gabriel wrote:
Christian Maeder wrote: [...]
splitOnAList :: Eq a => [a] -> [a] -> <To Be Decided>
to split on a list such as "\r\n", then you can use isPrefixOf whereupon the performance is good enough (actually, its not bad at all).
The special case for "\r\n" is actually trivial, because "\r" can simply be filtered out first.
But a general splitting on sublists could be implemented via
splitBy isNothing (replace (map Just sl) Nothing (map Just l))
(For a fixed sublist sl, "Nothing" is enough to represent the delimiter)
Nice. It took me a moment, but nice.
Good, see below my definition of subListSplit and unintercalate.
For a simple predicate "(a -> Bool)" it remains to discuss the output of splitBy.
You've proposed "[([a], [a])]", but for the simpler case "[(a, [a])]" or "[([a], a)]" may do,
Actually, although enticing, I do not believe that [(a, [a])] is possible due to the corner cases when there is no beginning non-delimiter or ending delimiter, that is, one needs [(Maybe a, [a])]. (Please check me on this.)
You're right here, there are several ways to accommodate all delimiters and non-delimiters. I've done it as below.
but in order to capture the full information of lists before and after delimiters, something like "([a], [(a, [a])])" is needed.
Since a tuple "(a, [a])" can be viewed as non-empty list of type "[a]", "([a], [(a, [a])])" collapses to a non-empty list of type "[[a]]" with a _tail_ of non-empty element lists.
I unfortunately do not follow you here, sorry. Be that as it may, I have come to appreciate the output [[a]].
In "([a], [(a, [a])])" the first component takes the (possibly empty) part before the first delimiter. The second part of type [(a, [a])] are the remaining delimiter with (possibly empty or longer) non-delimiter pairs. Such pairs are then viewed as non-empty lists. After changing the second component to [[a]] the resulting pair ([a], [[a]]) is then also changed to a non-empty list of lists [[a]].
wordsBy p = filter (not . null) . dropDelims . splitBy p linesBy p = dropDelims . dropFinalDelim . splitBy p
I'ld like to improve linesBy as follows: linesBy p = dropFinalNil . dropDelims . splitBy p (dropFinalNil is simpler than dropFinalDelim and dropDelims can assume a non-empty list from splitBy.)
So, in summary, your idea would be to introduce two functions into Data.List:
splitBy :: (a -> Bool) -> [a] -> [[a]] replace :: Eq a => [a] -> a -> [a] -> [a]
Is this correct?
Personally, I'ld be content with wordsBy only, but adding linesBy, splitBy and the combination of "dropDelims . splitBy p" (under some suitable names) would make sense for me with or without replace. (In fact, replace should be generalized further.)
If so, how would you define
splitOnAList :: Eq a => [a] -> [a] -> [[a]]
using splitBy and replace. For example,
splitOnAList "\r\n" "abc\r\nxyz\r\n" == ["abc","\r\n","xyz","\r\n"]
Again, for "\r\n" this is simply: concatMap (: ["\r\n"]) . linesBy (== '\n') . filter (/= '\r') For the general case take my unintercalate below: splitOnAList sl = intercalate [sl] . map (: []) . unintercalate sl This code only puts back identical delimiters that nobody needs because the delimiter is fixed and known via the input argument. For sublist matching, keeping delimiters is unnecessary in general!
reasonably handle the task. Actually, can even Data.List.Split reasonably handle the task? (I only just recalled this common little problem that is almost trivial but never really so.)
Data.List.Split can handle all these tasks (and more), too, only less elegant (I think).
import Data.Char (isSpace)
splitBy :: (a -> Bool) -> [a] -> [[a]] splitBy p l = let (fr, rt) = break p l in case rt of [] -> [fr] d : tl -> let hd : tll = splitBy p tl in fr : (d : hd) : tll
dropFinalDelim :: [[a]] -> [[a]]
forget dropFinalDelim one and take: dropFinalNil :: [[a]] -> [[a]] dropFinalNil ll@(_ : _) = if null (last ll) then init ll else ll
dropDelims :: [[a]] -> [[a]]
dropDelims does no longer need to work for empty inputs:
dropDelims ll = case ll of [] -> [] l : ls -> l : map tail ls
dropDelims (l : ls) = l : map tail ls but a total variant makes also sense: dropDelims ll = let (ft, rt) = splitAt 1 ll in ft ++ map (drop 1) rt
wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p = filter (not . null) . dropDelims . splitBy p
linesBy :: (a -> Bool) -> [a] -> [[a]]
change linesBy to: linesBy p = dropFinalNil . dropDelims . splitBy p
replace :: Eq a => [a] -> a -> [a] -> [a] replace sl@(_ : _) r l = case l of [] -> l x : xs -> case stripPrefix sl l of Nothing -> x : replace sl r xs Just rt -> r : replace sl r rt
subListSplit :: Eq a => [a] -> [a] -> [[Maybe a]] subListSplit sl@(_ : _) l = splitBy isNothing (replace (map Just sl) Nothing (map Just l))
unintercalate :: Eq a => [a] -> [a] -> [[a]] unintercalate sl@(_ : _) = map (map fromJust) . dropDelims . subListSplit sl
unintercalate can also be simplified to: unintercalate sl@(_ : _) = map catMaybes . subListSplit sl Cheers Christian

Thanks for the clarification. I would not have guessed wordsBy and linesBy as your preference over splitBy for additions to Data.List. I do not have an opinion about whether or not functional composition (pipelining) is more elegant than Brent's Splitter data type. I just know that I am very comfortable with constructs such as (f . g . h) or longer. By the way, out or curiosity, how would you generalize your current function replace? Cheers, - Marcus Christian Maeder wrote:
Marcus D. Gabriel wrote:
Christian Maeder wrote:
[...]
splitOnAList :: Eq a => [a] -> [a] -> <To Be Decided>
to split on a list such as "\r\n", then you can use isPrefixOf whereupon the performance is good enough (actually, its not bad at all).
The special case for "\r\n" is actually trivial, because "\r" can simply be filtered out first.
But a general splitting on sublists could be implemented via
splitBy isNothing (replace (map Just sl) Nothing (map Just l))
(For a fixed sublist sl, "Nothing" is enough to represent the delimiter)
Nice. It took me a moment, but nice.
Good, see below my definition of subListSplit and unintercalate.
For a simple predicate "(a -> Bool)" it remains to discuss the output of splitBy.
You've proposed "[([a], [a])]", but for the simpler case "[(a, [a])]" or "[([a], a)]" may do,
Actually, although enticing, I do not believe that [(a, [a])] is possible due to the corner cases when there is no beginning non-delimiter or ending delimiter, that is, one needs [(Maybe a, [a])]. (Please check me on this.)
You're right here, there are several ways to accommodate all delimiters and non-delimiters. I've done it as below.
but in order to capture the full information of lists before and after delimiters, something like "([a], [(a, [a])])" is needed.
Since a tuple "(a, [a])" can be viewed as non-empty list of type "[a]", "([a], [(a, [a])])" collapses to a non-empty list of type "[[a]]" with a _tail_ of non-empty element lists.
I unfortunately do not follow you here, sorry. Be that as it may, I have come to appreciate the output [[a]].
In "([a], [(a, [a])])" the first component takes the (possibly empty) part before the first delimiter. The second part of type [(a, [a])] are the remaining delimiter with (possibly empty or longer) non-delimiter pairs. Such pairs are then viewed as non-empty lists. After changing the second component to [[a]] the resulting pair ([a], [[a]]) is then also changed to a non-empty list of lists [[a]].
wordsBy p = filter (not . null) . dropDelims . splitBy p linesBy p = dropDelims . dropFinalDelim . splitBy p
I'ld like to improve linesBy as follows:
linesBy p = dropFinalNil . dropDelims . splitBy p
(dropFinalNil is simpler than dropFinalDelim and dropDelims can assume a non-empty list from splitBy.)
So, in summary, your idea would be to introduce two functions into Data.List:
splitBy :: (a -> Bool) -> [a] -> [[a]] replace :: Eq a => [a] -> a -> [a] -> [a]
Is this correct?
Personally, I'ld be content with wordsBy only, but adding linesBy, splitBy and the combination of "dropDelims . splitBy p" (under some suitable names) would make sense for me with or without replace. (In fact, replace should be generalized further.)
If so, how would you define
splitOnAList :: Eq a => [a] -> [a] -> [[a]]
using splitBy and replace. For example,
splitOnAList "\r\n" "abc\r\nxyz\r\n" == ["abc","\r\n","xyz","\r\n"]
Again, for "\r\n" this is simply:
concatMap (: ["\r\n"]) . linesBy (== '\n') . filter (/= '\r')
For the general case take my unintercalate below:
splitOnAList sl = intercalate [sl] . map (: []) . unintercalate sl
This code only puts back identical delimiters that nobody needs because the delimiter is fixed and known via the input argument.
For sublist matching, keeping delimiters is unnecessary in general!
reasonably handle the task. Actually, can even Data.List.Split reasonably handle the task? (I only just recalled this common little problem that is almost trivial but never really so.)
Data.List.Split can handle all these tasks (and more), too, only less elegant (I think).
import Data.Char (isSpace)
splitBy :: (a -> Bool) -> [a] -> [[a]] splitBy p l = let (fr, rt) = break p l in case rt of [] -> [fr] d : tl -> let hd : tll = splitBy p tl in fr : (d : hd) : tll
dropFinalDelim :: [[a]] -> [[a]]
forget dropFinalDelim one and take:
dropFinalNil :: [[a]] -> [[a]] dropFinalNil ll@(_ : _) = if null (last ll) then init ll else ll
dropDelims :: [[a]] -> [[a]]
dropDelims does no longer need to work for empty inputs:
dropDelims ll = case ll of [] -> [] l : ls -> l : map tail ls
dropDelims (l : ls) = l : map tail ls
but a total variant makes also sense:
dropDelims ll = let (ft, rt) = splitAt 1 ll in ft ++ map (drop 1) rt
wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p = filter (not . null) . dropDelims . splitBy p
linesBy :: (a -> Bool) -> [a] -> [[a]]
change linesBy to:
linesBy p = dropFinalNil . dropDelims . splitBy p
replace :: Eq a => [a] -> a -> [a] -> [a] replace sl@(_ : _) r l = case l of [] -> l x : xs -> case stripPrefix sl l of Nothing -> x : replace sl r xs Just rt -> r : replace sl r rt
subListSplit :: Eq a => [a] -> [a] -> [[Maybe a]] subListSplit sl@(_ : _) l = splitBy isNothing (replace (map Just sl) Nothing (map Just l))
unintercalate :: Eq a => [a] -> [a] -> [[a]] unintercalate sl@(_ : _) = map (map fromJust) . dropDelims . subListSplit sl
unintercalate can also be simplified to:
unintercalate sl@(_ : _) = map catMaybes . subListSplit sl
Cheers Christian
-- Marcus D. Gabriel, Ph.D. Saint Louis, FRANCE http://www.marcus.gabriel.name mailto:marcus@gabriel.name Tel: +33.3.89.69.05.06 Portable: +33.6.34.56.07.75

Marcus D. Gabriel wrote:
By the way, out or curiosity, how would you generalize your current function replace?
I would define: replaceBy :: ([a] -> ([b], [a])) -> [a] -> [b] replaceBy splt l = case l of [] -> [] _ -> let (ft, rt) = splt l in ft ++ replaceBy splt rt The first argument is a kind of splitting function that expects a non-empty input list to cut of a first part and change this part to the a replacement list of type [b]. The second part must be the rest of the input to be processed. The original replace function can then be defined via: replace :: Eq a => [a] -> a -> [a] -> [a] replace sl@(_ : _) r = replaceBy $ \ l@(hd : tl) -> case stripPrefix sl l of Nothing -> ([hd], tl) Just rt -> ([r], rt) Other replacements are also possible, i.e. replaceDelim :: ([a] -> ([a], [a])) -> [a] -> [Either [a] a] replaceDelim splt = replaceBy $ \ l@(hd : tl) -> let (ft, rt) = splt l in if null ft then ([Right hd], tl) else ([Left ft], rt) For replaceDelim consider an input function like "span isSpace". A sequence of spaces is grouped to a "Left String", whereas other characters become "Right". The resulting list is suitable for my original splitBy function yielding a list of type "[[Either [a] a]]". This result can be further flattened to [[a]] (by the function flatEitherSplits below). splitByDelim :: ([a] -> ([a], [a])) -> [a] -> [[a]] splitByDelim split = flatEitherSplits . splitBy (either (const True) (const False)) . replaceDelim split As a concrete example consider: splitBySpaces :: String -> [String] splitBySpaces = splitByDelim (span isSpace) splitBySpaces "ab c def" == ["ab"," ","c"," ","def"] For the sake of completeness I add the definition of flatEitherSplits that exploits the result structure of splitBy and uses auxiliary functions: flatRight :: [Either [a] a] -> [a] flatRight = map (either (error "flatRight") id) flatEither :: [Either [a] a] -> [[a]] flatEither (Left d : rt) = d : [flatRight rt] flatEitherSplits :: [[Either [a] a]] -> [[a]] flatEitherSplits (h : tl) = flatRight h : concatMap flatEither tl Cheers Christian P.S.
For the general case take my unintercalate below:
splitOnAList sl = intercalate [sl] . map (: []) . unintercalate sl
By the way, this can be expressed simpler: splitOnAList sl = intersperse sl . unintercalate sl
participants (7)
-
Alexander Dunlap
-
Brandon S. Allbery KF8NH
-
Brent Yorgey
-
Christian Maeder
-
Duncan Coutts
-
Henning Thielemann
-
Marcus D. Gabriel