
On Wed, 12 Jul 2006, Donald Bruce Stewart wrote:
I vote for this, currently implemented in Data.ByteString:
-- | split on characters split :: Char -> String -> [String]
-- | split on predicate * splitBy :: (Char -> Bool) -> String -> [String]
and -- | split on a string tokens :: String -> String -> [String]
OED on "token": 3b [Computing] The smallest meaningful unit of information in sequence of data for a compiler. I think that's more or less what it means to me, too. It may be possible to come up with a name that is more likely to suggest what it does and less likely to collide with identifiers used elsewhere. Maybe "splits", but anyway ideally including "split". Of course technically we seem to be talking about lists, but this last one is surely mostly about strings.
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
People will want both. The second form can be computed from the first, because it discards information about the input string, but for the same reason of course the first can't be derived from the second. (I'm not the first to say that, but since mail to this list has been arriving out of order, here it is again.) The convention I know, possibly coming from the world of UNIX shell tools, the default white space split is type 2, but split on any other string is type 1. UNIX shell does that, awk, Python ... (Perl is awk gone horribly wrong, so it presumably does but if it doesn't, it's the exception that proves the rule.) It has worked for a lot of people who do a lot of splitting, for a lot of years. Donn Cave, donn@drizzle.com