RE: Prelude function suggestions

On 27 July 2004 23:44, ariep@xs4all.nl wrote:
{- | Split the list at the occurrences of a separator into sub-list. This is a generalization of 'words'. -} chop :: (a -> Bool) -> [a] -> [[a]] chop p s = let (l, s') = break p s in l : case s' of [] -> [] (_:rest) -> chop p rest
I like 'chop'. It belongs in Data.List, I'd say.
We have something very like this inside GHC, except we call it 'split'. It's very useful for splitting up file paths into components. For comparison, Perl calls it 'split'. In Perl, 'chop' does something different. Python also has a 'split' function that does something similar. Does anyone object to calling it 'split'? Cheers, Simon

simonmar:
On 27 July 2004 23:44, ariep@xs4all.nl wrote:
{- | Split the list at the occurrences of a separator into sub-list. This is a generalization of 'words'. -} chop :: (a -> Bool) -> [a] -> [[a]] chop p s = let (l, s') = break p s in l : case s' of [] -> [] (_:rest) -> chop p rest
I like 'chop'. It belongs in Data.List, I'd say.
We have something very like this inside GHC, except we call it 'split'. It's very useful for splitting up file paths into components.
For comparison, Perl calls it 'split'. In Perl, 'chop' does something different. Python also has a 'split' function that does something similar.
Does anyone object to calling it 'split'?
I often have the need to write this function. Sometimes it gets called split, sometimes psplit (p for perl) :) -- Don

{- | Split the list at the occurrences of a separator into sub-list. This is a generalization of 'words'. -}
There should be two versions of this, one which treats repeated delimiters the same as single ones and ignores initial and final delimiters, and one which treats repeated delimiters as creating an empty field. Thus
split isSpace " foo bar baz " = ["foo","bar","baz"]
split' isSpace " foo bar baz " = ["","foo","","bar","baz",""]
--KW 8-)
--
Keith Wansbrough

split isSpace " foo bar baz " = ["foo","bar","baz"] split' isSpace " foo bar baz " = ["","foo","","bar","baz",""]
split can be obtained from split' via composition with "filter (not . null)". Therefore the second version (split') is more important. The name "chop" confused me in the first place (as I thought only the last element should be chopped off) The name "split" is a bit too general. (compared with "intersperse" that somehow computes the opposite.) The haddock library docs http://www.haskell.org/ghc/docs/latest/html/libraries/doc-index-S.html show: split 1 (Function) GHC.Exts 2 (Function) Language.Haskell.THSyntax 3 (Function) System.Random, Random splitAt Data.List, GHC.List, Prelude, List splitAtPS Data.PackedString splitPS Data.PackedString splitWithPS Data.PackedString According to PackedString the above function split' should be named "splitWith" since it has a predicate argument. (So we do not need to conform to PackedString.) Btw "splitWithPS" and "splitPS" are wrong as they ignore a (single) final blank! splitWithPS isSpace $ packString " foo bar baz " = ["","foo","","bar","baz"] Maybe "splitToLists" or "splitUp" (or "splitUpOn") are better names Christian

W liście z śro, 28-07-2004, godz. 10:44 +0100, Keith Wansbrough napisał:
There should be two versions of this, one which treats repeated delimiters the same as single ones and ignores initial and final delimiters, and one which treats repeated delimiters as creating an empty field. Thus
split isSpace " foo bar baz " = ["foo","bar","baz"] split' isSpace " foo bar baz " = ["","foo","","bar","baz",""]
I just wanted to name such functions for my language and they are called Split and Split1. I like the name split, as it's used in Python, Perl and Ruby; I don't like the name split1 but don't have anything better. Python interprets the terminating separators as separators and never returns an empty list, while Perl and Ruby ignore them: [qrczak ~]$ python -c 'print "aabbcc".split("a")' ['', '', 'bbcc'] [qrczak ~]$ python -c 'print "aabbcc".split("c")' ['aabb', '', ''] [qrczak ~]$ perl -e 'print join(",", split(/a/, "aabbcc")), "\n"' ,,bbcc [qrczak ~]$ perl -e 'print join(",", split(/c/, "aabbcc")), "\n"' aabb [qrczak ~]$ ruby -e 'p "aabbcc".split("a")' ["", "", "bbcc"] [qrczak ~]$ ruby -e 'p "aabbcc".split("c")' ["aabb"] I prefer the Python semantics. The other can be obtained by removing trailing empty elements from the result, for which there should be an equivalent of dropWhile working from the other end... -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

W liście z śro, 28-07-2004, godz. 14:11 +0200, Marcin 'Qrczak' Kowalczyk napisał:
I just wanted to name such functions for my language and they are called Split and Split1. I like the name split, as it's used in Python, Perl and Ruby; I don't like the name split1 but don't have anything better.
Or what about this: splitSep - separators are between splitEnd - separators are between, with an optional separator at the end splitSeps - separators are between, many separators are permitted splitEnd is a generalization of lines, it's equivalent to splitSep with the last element removed if it's empty; splitSeps is a generalization of words, it's equivalent to splitSep with all empty elements removed. Python's split is splitSep. Perl's and Ruby's are neither of these, but I think that removing *all* trailing empty elements is not useful. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

At 10:44 28/07/04 +0100, Keith Wansbrough wrote:
{- | Split the list at the occurrences of a separator into sub-list. This is a generalization of 'words'. -}
There should be two versions of this, one which treats repeated delimiters the same as single ones and ignores initial and final delimiters, and one which treats repeated delimiters as creating an empty field. Thus
split isSpace " foo bar baz " = ["foo","bar","baz"] split' isSpace " foo bar baz " = ["","foo","","bar","baz",""]
I'm not sure if I support your exact conclusion, but I think this raises an important consideration about exactly what split should be. I just scanned through my codebase for similar functions, and found two versions: [[ splitBy :: (a->Bool) -> [a] -> [[a]] splitBy _ [] = [] splitBy p (s0:str) = let (s1,sr) = break p str in (s0:s1):splitBy p sr t1 = splitBy (`elem` "/\\") "//foo\\bar/baz" == ["/","/foo","\\bar","/baz"] ]] The separator is *not* removed and [[ breakAll :: (a -> Bool) -> [a] -> [[a]] breakAll _ [] = [] breakAll p s = let (h,s') = break p s in h : breakAll p (drop 1 s') t2 = breakAll (==',') "ab,cde,f,g,hij,,kl,," == ["ab","cde","f","g","hij","","kl",""] ]] Where the separator is removed. I had requirements for both forms, but if it's not clear that one form is superior I'd be less inclined to try and "standardize" it at all. And all this makes me realize that there are several other candidates such as additional variations suggested by Keith. #g ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

[split, chop, all that]
How about biting the bullet and providing a real "tokenizer"? I have had the problem of having to split a text into lines, for instance, which used \r\n as EOL marker, not just \n. So I couldn't use 'lines'. Judging by the 'split' (or 'chop') proposals I've seen so far, I wouldn't be able to use them for that purpose either, because they don't support multi-byte tokens. Shooting from the hip, I'd say this more general function would do the trick: tokenize :: (a -> Bool) -> (a -> Bool) -> [a] -> [[a]] The first function returns 'True' if the the current input element is part of a valid token. The second function (the "skipper") would return 'True' if the current element is ignorable "whitespace". The input "foo bar \t claus \r\n stuff", for instance, could be tokenized into ["foo", "bar", "claus", "stuff"] by something along the lines of the following function call: tokenize isAlphaNum isSpace "input string" Basically, the 'tokenize' function would consume input until the first function says "False". Then it would consume (and drop) input until the second function says "False". And so on, until the end of input string is reached. One would have to think about what 'tokenize' would do if _both_ functions say 'False' for an input element, but IMHO that could just be an 'error'. I think that would be a nice addition to the standard library, and 'split' (or 'chop') would simply be specialized versions of this one. Peter

W liście z śro, 28-07-2004, godz. 18:09 +0200, Peter Simons napisał:
How about biting the bullet and providing a real "tokenizer"?
You can try using Parsec or another similar library for more complex tasks.
tokenize :: (a -> Bool) -> (a -> Bool) -> [a] -> [[a]]
The first function returns 'True' if the the current input element is part of a valid token. The second function (the "skipper") would return 'True' if the current element is ignorable "whitespace".
What if both return True, or if neither returns True?
Basically, the 'tokenize' function would consume input until the first function says "False". Then it would consume (and drop) input until the second function says "False". And so on, until the end of input string is reached.
This is my SplitSeps (formerly Split1), assuming the second function is just the negation of the first one. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk writes:
You can try using Parsec or another similar library for more complex tasks.
IMHO, Parsec is too complicated for something that simply. Stick to the example of implementing a 'lines' that uses \r\n instead of \n, if you will: I don't want any monads, I don't want error messages, I don't need any recursive descent parser. All I need is a simple tokenizer. In the past I have used 'lines' and then chopped off the last character off each line. Fair enough, but I think that there is a more general problem underneath that deserves to be solved. I see this function underneath Parsec. I would, under the right circumstances, use Parsec to write the functions I give to 'tokenize' as arguments! But I wouldn't write 'tokenize' in Parsec.
What if both return True, or if neither returns True?
That's an 'error'.
This is my SplitSeps (formerly Split1), assuming the second function is just the negation of the first one.
Great! Then I second that your function is added to the standard library. :-) Peter

W liście z śro, 28-07-2004, godz. 18:59 +0200, Peter Simons napisał:
IMHO, Parsec is too complicated for something that simply.
It is, but I don't have anything else.
Stick to the example of implementing a 'lines' that uses \r\n instead of \n, if you will: I don't want any monads, I don't want error messages, I don't need any recursive descent parser.
let splitLines str = result where Right result = parse (anyChar `manyTill` sep `manyTill` eof) "" str sep = (char '\n' >> return ()) <|> (try (string "\r\n") >> return ()) <|> eof BTW, IMHO the 'string' function should use 'try' automatically. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Peter Simons
Shooting from the hip, I'd say this more general function would do the trick:
tokenize :: (a -> Bool) -> (a -> Bool) -> [a] -> [[a]]
And of course, I needed to split based on a function of the current suffix, not individual characters. breakWhere :: ([a] -> Bool) -> [a] -> [[a]] Plenty of variations for this one, it seems. -kzm -- If I haven't seen further, it is by standing in the footprints of giants
participants (8)
-
Christian Maeder
-
dons@cse.unsw.edu.au
-
Graham Klyne
-
Keith Wansbrough
-
Ketil Malde
-
Marcin 'Qrczak' Kowalczyk
-
Peter Simons
-
Simon Marlow