Re: Proposal for Data.List.splitBy

8 Feb 2009

      Christian Maeder wrote:
...
Marcus D. Gabriel wrote:
...
If I were to write
organizeBy :: ([a] -> Bool) -> [a] -> [([a], [a])]
I quite like your idea,
Thanks.
...
but I think the input predicate "([a] -> Bool)" is too ambitious, although it would nicely unify Brent's
data Delimiter a where
  DelimEltPred :: (a -> Bool) -> Delimiter a
  DelimSublist :: Eq a => [a] -> Delimiter a
With the predicate ([a] -> Bool) you have to check all inits of your
input to detect a delimiter and only if all inits are no delimiter you
know the head element is not part of a delimiter and repeat checking the
inits of the tail. I think this is too inefficient in general.
The name "organizeBy" was a metaphor.  I was trying to distance my
language from words such as split, delimiter, and separator to
something more general.  As for the use of inits, yes, this is a
performance hit.  If you use something like
...
splitOnAList :: Eq a => [a] -> [a] -> <To Be Decided>
to split on a list such as "\r\n", then you can use isPrefixOf
whereupon the performance is good enough (actually, its not bad
at all).

(http://www.haskell.org/pipermail/libraries/2009-January/011061.html)

In retrospect, the idea of splitting, delimiters, separators, and
fields is what this is all about anyway.
...
In order to keep things simple I would vote for a split function that
only takes the simple predicate (a -> Bool) and leaves the more
complicated splitting to _additional_ functions
(following http://haskell.org/haskellwiki/Simple_to_complex)
For instance
replace :: [a] -> a -> [a] -> [a]
could replace a sublist (first argument) with a single element (second
argument)
This would help in simple cases, but leaves it to the user to choose a
suitable delimiter element.
But a general splitting on sublists could be implemented via
splitBy isNothing (replace (map Just sl) Nothing (map Just l))
(For a fixed sublist sl, "Nothing" is enough to represent the delimiter)
Nice.  It took me a moment, but nice.
...
For a simple predicate "(a -> Bool)" it remains to discuss the output of
splitBy.
You've proposed "[([a], [a])]", but for the simpler case
"[(a, [a])]" or "[([a], a)]" may do,
Actually, although enticing, I do not believe that [(a, [a])] is
possible due to the corner cases when there is no beginning
non-delimiter or ending delimiter, that is, one needs
[(Maybe a, [a])].  (Please check me on this.)
...
but in order to capture the full
information of lists before and after delimiters, something like
"([a], [(a, [a])])" is needed.
Since a tuple "(a, [a])" can be viewed as non-empty list of type "[a]",
"([a], [(a, [a])])" collapses to a non-empty list of type "[[a]]" with a
_tail_ of non-empty element lists.
I unfortunately do not follow you here, sorry.  Be that as it may,
I have come to appreciate the output [[a]].
...
Therefore I propose the following
splitBy as a "work horse".
splitBy :: (a -> Bool) -> [a] -> [[a]]
This works for me.
...
The implementation is simply using "break". The basic property is:
concat (splitBy p l) == l
I would consider this a requirement, so this works for me.
...
(This is almost in the spirit of Data.List, since groupBy or words also
produces non-empty element lists.)
Getting rid of a final delimiter or all of them can be implemented in
O(n) (see below).
And finally:
wordsBy p = filter (not . null) . dropDelims . splitBy p
  linesBy p = dropDelims . dropFinalDelim . splitBy p
with:
words s == wordsBy isSpace s
  lines s == linesBy (== '\n') s
Surely, one might prefer a more type-safe version wrt non-empty lists,
but the main idea is to use function composition rather than composing a
(non-haskell98) splitter input data type as Brent does in his split package.
I am aligned with your point of view above in the sense of
additions to Data.List.

The more that I think about and play with your definition of the
function replace, the more I like it.  It captures an idiom that
I have used on the UNIX command line when hacking away on text, that is,
replace sequences with an unambiguous marker for later use.

So, in summary, your idea would be to introduce two functions into
Data.List:
...
splitBy :: (a -> Bool) -> [a] -> [[a]]
replace :: Eq a => [a] -> a -> [a] -> [a]
Is this correct?  If so, how would you define
...
splitOnAList :: Eq a => [a] -> [a] -> [[a]]
using splitBy and replace.  For example,
...
splitOnAList "\r\n" "abc\r\nxyz\r\n" == ["abc","\r\n","xyz","\r\n"]
Given this, maybe we should look at all of the many ways that
software makes CSV (TSV) files and see if splitBy and replace can
reasonably handle the task.  Actually, can even Data.List.Split
reasonably handle the task?  (I only just recalled this common
little problem that is almost trivial but never really so.)

Cheers,
- Marcus
...
Cheers Christian
import Data.Char (isSpace)
splitBy :: (a -> Bool) -> [a] -> [[a]]
splitBy p l = let (fr, rt) = break p l in case rt of
  [] -> [fr]
  d : tl -> let hd : tll = splitBy p tl in fr : (d : hd) : tll
dropFinalDelim :: [[a]] -> [[a]]
dropFinalDelim ll@(l : ls) =
  if null ls then
      if null l then [] else ll
  else
      if null (tail (last ls)) then init ll else ll
dropDelims :: [[a]] -> [[a]]
dropDelims ll = case ll of
  [] -> []
  l : ls -> l : map tail ls
wordsBy :: (a -> Bool) -> [a] -> [[a]]
wordsBy p = filter (not . null) . dropDelims . splitBy p
linesBy :: (a -> Bool) -> [a] -> [[a]]
linesBy p = dropDelims . dropFinalDelim . splitBy p
prop_wordsBy :: String -> Bool
prop_wordsBy s = words s == wordsBy isSpace s
prop_linesBy :: String -> Bool
prop_linesBy s = lines s == linesBy (== '\n') s
replace :: Eq a => [a] -> a -> [a] -> [a]
replace sl@(_ : _) r l = case l of
  [] -> l
  x : xs -> case stripPrefix sl l of
    Nothing -> x : replace sl r xs
    Just rt -> r : replace sl r rt
subListSplit :: Eq a => [a] -> [a] -> [[Maybe a]]
subListSplit sl@(_ : _) l =
  splitBy isNothing (replace (map Just sl) Nothing (map Just l))
unintercalate :: Eq a => [a] -> [a] -> [[a]]
unintercalate sl@(_ : _) =
  map (map fromJust) . dropDelims . subListSplit sl
-- 
  Marcus D. Gabriel, Ph.D.                         Saint Louis, FRANCE
  http://www.marcus.gabriel.name            mailto:marcus@gabriel.name