wordsBy in the base libraries?

Hello all, What do you think about having a wordsBy function in the standard libraries? It often comes in handy.
wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p s = case dropWhile p s of [] -> [] ':rest -> (s':w) : wordsBy p s'' where (w, s'') = break p rest
This version takes care of avoiding a redundant character check that is present in the standard definition of the words function, originally discovered by Neil Mitchell, see his blog post here: http://neilmitchell.blogspot.com/2007/07/equational-reasoning-in-haskell.htm... Then we can define the words function naturally:
words :: String -> [String] words = wordsBy isSpace
Cheers, Maxime

Hi We certainly need a function to split a list into sections. Whether it be wordsBy, or something like linesBy, or some new split variant.
wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p s = case dropWhile p s of [] -> [] s':rest -> (s':w) : wordsBy p s'' where (w, s'') = break p rest
You are still over by one test. Try instead: wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p s = case dropWhile p s of [] -> [] s':rest -> (s':w) : wordsBy p (drop 1 s'') where (w, s'') = break p rest The supero paper (http://www-users.cs.york.ac.uk/~ndm/supero/) has a very short explanation of this. Thanks Neil

Neil Mitchell wrote:
You are still over by one test. Try instead:
wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p s = case dropWhile p s of [] -> [] s':rest -> (s':w) : wordsBy p (drop 1 s'') where (w, s'') = break p rest
This still has the redundant empty list tests, and in fact introduces another one in drop 1, as well as some gratuitous arithmetic. -Yitz

Hi
You are still over by one test. Try instead:
wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p s = case dropWhile p s of [] -> [] s':rest -> (s':w) : wordsBy p (drop 1 s'') where (w, s'') = break p rest
This still has the redundant empty list tests,
Perhaps. Something like SpecConstr can remove them. Also remember that p may be arbitrarily expensive, while a test for an empty list is cheap. In the particular case of words, p (i.e. isSpace) is very expensive.
and in fact introduces another one in drop 1, as well as some gratuitous arithmetic.
The actual version I use is drop1, where drop1 is defined as: drop1 [] = [] drop1 (x:xs) = xs Also known as safeTail in the "Safe" library. Thanks Neil

How about this: wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p (x:xs) = if p x then wordsBy p xs else let (w, t) = getNextWord xs in (x:w) : t where getNextWord (x:xs) = if p x then ([], wordsBy p xs) else let (w, t) = getNextWord xs in (x:w, t) getNextWord _ = ([], []) wordsBy _ _ = [] (yuck) -Yitz

On 22 Oct 2007, at 10:50 pm, Neil Mitchell wrote:
wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy p s = case dropWhile p s of [] -> [] s':rest -> (s':w) : wordsBy p (drop 1 s'') where (w, s'') = break p rest
For things like this, I think it may be easier to just write down a finite state automaton, so that it is *obvious* that each character is fetched and tested exactly once. wordsBy :: (a -> Bool) -> [a] -> [[a]] wordsBy sep str = s_skip str where s_skip [] = [] s_skip (c:cs) = if sep c then s_skip cs else s_word cs [c] s_word [] w = [reverse w] s_word (c:cs) w = if sep c then reverse w : s_skip cs else s_word cs (c:w) -- s_skip state: not in a word, skipping separators -- s_word state: in a word, accumulating "letters" Of course the s_skip state corresponds to the dropWhile call, but is there any guarantee about how often dropWhile or break do their tests? The FSA approach remains easy to use in more complex cases where dropWhile and break are awkward.

Maxime Henrion wrote:
This version takes care of avoiding a redundant character check
It still has several redundant checks: s'' is known to begin with space if it is non-empty, but that is checked again in the recursive call of wordsBy. In fact, break already checked whether s'' is empty, but wordsBy checks that again, too. Also - the pattern match in the case checks whether its list is empty, but that was already checked by dropWhile. If you want to guarantee that level of optimization, you probably have to inline everything by hand and not use any library functions at all. I wonder how much of that is already done by the compiler, though. -Yitz

Hi
If you want to guarantee that level of optimization, you probably have to inline everything by hand and not use any library functions at all. I wonder how much of that is already done by the compiler, though.
A reasonable level of optimisation for this function would be that the predicate is invoked at most once on each character. Since its easy to do, it would be a shame not to. The other things are all "small", compared to the predicate whose cost is unknown. Thanks Neil

Neil Mitchell wrote:
...Also remember that p may be arbitrarily expensive, while a test for an empty list is cheap. In the particular case of words, p (i.e. isSpace) is very expensive... A reasonable level of optimisation for this function would be that the predicate is invoked at most once on each character. Since its easy to do, it would be a shame not to. The other things are all "small", compared to the predicate whose cost is unknown.
Yes, that makes sense. That is the main issue. Though here it's not much harder to optimize away the extra empty list tests, also. Thanks, Yitz

Maxime Henrion wrote:
Hello all,
What do you think about having a wordsBy function in the standard libraries? It often comes in handy.
I speculate (but don't know) that the reason we don't have one is that there are quite a few choices to make: * delimiter as function (Char -> Bool) or Char * delimit by single chars only or multiple (String -> Bool) * multiple adjacent delimiters cause empty items, or not? * exact delimiter which occured returned as part of the result? ... my anecdotal experience is that in practice for slightly different applications you need different permutations of these choices. A single function which encompassed all the options would have a cumbersome type. Jules

Jules Bean wrote:
Maxime Henrion wrote:
Hello all,
What do you think about having a wordsBy function in the standard libraries? It often comes in handy.
I speculate (but don't know) that the reason we don't have one is that there are quite a few choices to make:
* delimiter as function (Char -> Bool) or Char * delimit by single chars only or multiple (String -> Bool) * multiple adjacent delimiters cause empty items, or not? * exact delimiter which occured returned as part of the result?
... my anecdotal experience is that in practice for slightly different applications you need different permutations of these choices. A single function which encompassed all the options would have a cumbersome type.
I fully agree with you in that we often need slight variations of such a function, and that writing one that encompasses all different uses would probably be less convenient to use. However, we could argue just the same about the "words" function. Since it's there, and that it's going to stay, I still find the "wordsBy" function nice to have in the standard libraries. Besides, it seems to me we often need that particular behaviour of words (single character matches, "eating" empty items). Cheers, Maxime
participants (5)
-
Jules Bean
-
Maxime Henrion
-
Neil Mitchell
-
Richard A. O'Keefe
-
Yitzchak Gale