
In message <20070918143444.C08F894047@webmail220.herald.ox.ac.uk> Duncan Coutts
I want to start a discussion about a string searching api, especially with respect to ByteStrings. compare this to our standard List.break function.
break isSpace "foo bar" = ("foo", " bar") break isSpace "foobar" = ("foo bar", "")
Oops, copy'n'pasto
break isSpace "foobar" = ("foobar", "")
obviously.
findSubstrings :: ByteString -> ByteString -> [ByteString]
where each result begins an occurrence and includes the trailing text before the next occurrence.
It has been pointed out to me that there are two styles here that correspond naturally to overlapping or non-overlapping searches. If we want to look for all occurrences, even occurrences that overlap then it makes sense for each result to include the whole tail of the string, not just up to the next occurrence. eg:
findSubstrings "foo" "blah foo bar foo baz" = ["foo bar foo baz", "foo baz"]
then it's clear we do not need an initial span since that's just the whole string. If we want to split into non-overlapping spans then it makes more sense to provide the initial span too:
findSubstrings "foo" "blah foo bar foo baz" = ("blah ", ["foo bar ", "foo baz"])
Tagsoup makes this kind of distinction: http://www.cs.york.ac.uk/fp/haddock/tagsoup/Text-HTML-TagSoup.html#v%3Aparti... I have to say, I'm rather inclined to not prejudge the issue and not provide any findSubstrings function at the moment and wait and see what the common patterns are and if any are worth putting in the lib. So perhaps that's my straw-man proposal: * change BS.findSubstring to be :: BS -> BS -> (BS, BS) in the style of List.break * remove the current BS.findSubstrings and of course to also add findSubstring with the equivalent type for the ByteString.Lazy module. Duncan