
I want to start a discussion about a string searching api, especially with respect to ByteStrings. I'm concerned that we're about to release a bytestring package along with ghc-6.8 that will freeze a less than ideal string search api in stone. We have an opportunity to make that better since at the moment I think there are few serious users of that api since the current implementation is very slow. However in this release cycle we're going to get optimised string searching algorithms and then I expect the number of users to pick up. I'd be a shame if that forces us to keep a not very nice api. That said, here is the current api. -- | Find the indexes of all (possibly overlapping) occurances of a -- substring in a string. findSubstrings :: ByteString -- ^ String to search for. -> ByteString -- ^ String to seach in. -> [Int] -- | Get the first index of a substring in another string, -- or 'Nothing' if the string is not found. findSubstring :: ByteString -- ^ String to search for. -> ByteString -- ^ String to seach in. -> Maybe Int findSubstring p s = listToMaybe (findSubstrings p s) -- | Check whether one string is a substring of another. isSubstringOf :: ByteString -- ^ String to search for. -> ByteString -- ^ String to search in. -> Bool isSubstringOf p s = not (null (findSubstrings p s)) We only have it implemented for strict ByteStrings at the moment. This api isn't too bad for the strict ByteString implementation since we can use indexes to select substrings in O(1) time. However for lazy bytestrings index are much less good since we have to traverse the list of chunks to use the indexes. A more compelling argument is that it's not very Haskelly to use indexes. We normally deal in spans rather than offsets and lengths. One possible alternative that I think is more Haskelly and would suit the strict and lazy (and indeed ordinary list []) implementation is: findSubstring :: ByteString -- ^ String to search for. -> ByteString -- ^ String to seach in. -> (ByteString, -- span before search string ByteString) -- span beginning search string, or empty compare this to our standard List.break function.
break isSpace "foo bar" = ("foo", " bar") break isSpace "foobar" = ("foo bar", "")
So for findSubstring we'd have:
findSubstring "bar" "foo bar baz" = ("foo ","bar baz") findSubstring "notfound" "foo bar baz" = ("foo bar baz", "")
I quite like this as the primitive. isSubstringOf is easily definable in terms of that. I should note that there is really no performance penalty for working in terms of spans rather than indexes. Indeed for the lazy case it'd be faster. findSubstrings is much less clear. What's not clear is what the result type and content should be. The current impl that gives us a list of indexes of possibly-overlapping occurrences seems straightforward. What is the break/span style equivalent? A couple possibilities: findSubstrings :: ByteString -> ByteString -> [ByteString] where each result begins an occurrence and includes the trailing text before the next occurrence. Of course that means we loose any initial span. So we could have: findSubstrings :: ByteString -> ByteString -> (ByteString, [ByteString]) that includes the initial span. Then there's the issue of overlapping spans or not. Overlapping seems more general but also seems confusing when done in terms of spans. Or we could just side-step the issue and not provide a findSubstrings at all and let people implement the behavior they want in terms of findSubstring. Afterall there is no 'many' tokenising variant of List.span or List.break, probably because of the wide variation in requirements. I'd very much appreciate peoples opinions on this. Though bear in mind that I'm not looking for a large string searching framework, I want to pick a supportable primitive for strict and lazy bytestring searching that can go into the bytestring library now so we can avoid freezing a worse api. Duncan