substring search api

18 Sep 2007

      I want to start a discussion about a string searching api, especially with
respect to ByteStrings.

I'm concerned that we're about to release a bytestring package along with
ghc-6.8 that will freeze a less than ideal string search api in stone. We have
an opportunity to make that better since at the moment I think there are few
serious users of that api since the current implementation is very slow. However
in this release cycle we're going to get optimised string searching algorithms
and then I expect the number of users to pick up. I'd be a shame if that forces
us to keep a not very nice api.

That said, here is the current api.

-- | Find the indexes of all (possibly overlapping) occurances of a
-- substring in a string.
findSubstrings :: ByteString -- ^ String to search for.
               -> ByteString -- ^ String to seach in.
               -> [Int]

-- | Get the first index of a substring in another string,
--   or 'Nothing' if the string is not found.
findSubstring :: ByteString -- ^ String to search for.
              -> ByteString -- ^ String to seach in.
              -> Maybe Int
findSubstring p s = listToMaybe (findSubstrings p s)

-- | Check whether one string is a substring of another.
isSubstringOf :: ByteString -- ^ String to search for.
              -> ByteString -- ^ String to search in.
              -> Bool
isSubstringOf p s = not (null (findSubstrings p s))

We only have it implemented for strict ByteStrings at the moment. This api isn't
too bad for the strict ByteString implementation since we can use indexes to
select substrings in O(1) time. However for lazy bytestrings index are much less
good since we have to traverse the list of chunks to use the indexes.

A more compelling argument is that it's not very Haskelly to use indexes. We
normally deal in spans rather than offsets and lengths.

One possible alternative that I think is more Haskelly and would suit the strict
and lazy (and indeed ordinary list []) implementation is:

findSubstring :: ByteString -- ^ String to search for.
              -> ByteString -- ^ String to seach in.
              -> (ByteString,  -- span before search string
                  ByteString)  -- span beginning search string, or empty

compare this to our standard List.break function.
...
break isSpace "foo bar" = ("foo", " bar")
break isSpace "foobar"  = ("foo bar", "")
So for findSubstring we'd have:
...
findSubstring "bar" "foo bar baz"      = ("foo ","bar baz")
findSubstring "notfound" "foo bar baz" = ("foo bar baz", "")
I quite like this as the primitive.

isSubstringOf is easily definable in terms of that.

I should note that there is really no performance penalty for working in terms
of spans rather than indexes. Indeed for the lazy case it'd be faster.

findSubstrings is much less clear. What's not clear is what the result type and
content should be. The current impl that gives us a list of indexes of
possibly-overlapping occurrences seems straightforward. What is the break/span
style equivalent?

A couple possibilities:

findSubstrings :: ByteString -> ByteString -> [ByteString]

where each result begins an occurrence and includes the trailing text before the
next occurrence. Of course that means we loose any initial span. So we could have:

findSubstrings :: ByteString -> ByteString -> (ByteString, [ByteString])

that includes the initial span.

Then there's the issue of overlapping spans or not. Overlapping seems more
general but also seems confusing when done in terms of spans.

Or we could just side-step the issue and not provide a findSubstrings at all and
let people implement the behavior they want in terms of findSubstring. Afterall
there is no 'many' tokenising variant of List.span or List.break, probably
because of the wide variation in requirements.

I'd very much appreciate peoples opinions on this. Though bear in mind that I'm
not looking for a large string searching framework, I want to pick a supportable
primitive for strict and lazy bytestring searching that can go into the
bytestring library now so we can avoid freezing a worse api.

Duncan

substring search api

Duncan Coutts