substring search api

I want to start a discussion about a string searching api, especially with respect to ByteStrings. I'm concerned that we're about to release a bytestring package along with ghc-6.8 that will freeze a less than ideal string search api in stone. We have an opportunity to make that better since at the moment I think there are few serious users of that api since the current implementation is very slow. However in this release cycle we're going to get optimised string searching algorithms and then I expect the number of users to pick up. I'd be a shame if that forces us to keep a not very nice api. That said, here is the current api. -- | Find the indexes of all (possibly overlapping) occurances of a -- substring in a string. findSubstrings :: ByteString -- ^ String to search for. -> ByteString -- ^ String to seach in. -> [Int] -- | Get the first index of a substring in another string, -- or 'Nothing' if the string is not found. findSubstring :: ByteString -- ^ String to search for. -> ByteString -- ^ String to seach in. -> Maybe Int findSubstring p s = listToMaybe (findSubstrings p s) -- | Check whether one string is a substring of another. isSubstringOf :: ByteString -- ^ String to search for. -> ByteString -- ^ String to search in. -> Bool isSubstringOf p s = not (null (findSubstrings p s)) We only have it implemented for strict ByteStrings at the moment. This api isn't too bad for the strict ByteString implementation since we can use indexes to select substrings in O(1) time. However for lazy bytestrings index are much less good since we have to traverse the list of chunks to use the indexes. A more compelling argument is that it's not very Haskelly to use indexes. We normally deal in spans rather than offsets and lengths. One possible alternative that I think is more Haskelly and would suit the strict and lazy (and indeed ordinary list []) implementation is: findSubstring :: ByteString -- ^ String to search for. -> ByteString -- ^ String to seach in. -> (ByteString, -- span before search string ByteString) -- span beginning search string, or empty compare this to our standard List.break function.
break isSpace "foo bar" = ("foo", " bar") break isSpace "foobar" = ("foo bar", "")
So for findSubstring we'd have:
findSubstring "bar" "foo bar baz" = ("foo ","bar baz") findSubstring "notfound" "foo bar baz" = ("foo bar baz", "")
I quite like this as the primitive. isSubstringOf is easily definable in terms of that. I should note that there is really no performance penalty for working in terms of spans rather than indexes. Indeed for the lazy case it'd be faster. findSubstrings is much less clear. What's not clear is what the result type and content should be. The current impl that gives us a list of indexes of possibly-overlapping occurrences seems straightforward. What is the break/span style equivalent? A couple possibilities: findSubstrings :: ByteString -> ByteString -> [ByteString] where each result begins an occurrence and includes the trailing text before the next occurrence. Of course that means we loose any initial span. So we could have: findSubstrings :: ByteString -> ByteString -> (ByteString, [ByteString]) that includes the initial span. Then there's the issue of overlapping spans or not. Overlapping seems more general but also seems confusing when done in terms of spans. Or we could just side-step the issue and not provide a findSubstrings at all and let people implement the behavior they want in terms of findSubstring. Afterall there is no 'many' tokenising variant of List.span or List.break, probably because of the wide variation in requirements. I'd very much appreciate peoples opinions on this. Though bear in mind that I'm not looking for a large string searching framework, I want to pick a supportable primitive for strict and lazy bytestring searching that can go into the bytestring library now so we can avoid freezing a worse api. Duncan

In message <20070918143444.C08F894047@webmail220.herald.ox.ac.uk> Duncan Coutts
I want to start a discussion about a string searching api, especially with respect to ByteStrings. compare this to our standard List.break function.
break isSpace "foo bar" = ("foo", " bar") break isSpace "foobar" = ("foo bar", "")
Oops, copy'n'pasto
break isSpace "foobar" = ("foobar", "")
obviously.
findSubstrings :: ByteString -> ByteString -> [ByteString]
where each result begins an occurrence and includes the trailing text before the next occurrence.
It has been pointed out to me that there are two styles here that correspond naturally to overlapping or non-overlapping searches. If we want to look for all occurrences, even occurrences that overlap then it makes sense for each result to include the whole tail of the string, not just up to the next occurrence. eg:
findSubstrings "foo" "blah foo bar foo baz" = ["foo bar foo baz", "foo baz"]
then it's clear we do not need an initial span since that's just the whole string. If we want to split into non-overlapping spans then it makes more sense to provide the initial span too:
findSubstrings "foo" "blah foo bar foo baz" = ("blah ", ["foo bar ", "foo baz"])
Tagsoup makes this kind of distinction: http://www.cs.york.ac.uk/fp/haddock/tagsoup/Text-HTML-TagSoup.html#v%3Aparti... I have to say, I'm rather inclined to not prejudge the issue and not provide any findSubstrings function at the moment and wait and see what the common patterns are and if any are worth putting in the lib. So perhaps that's my straw-man proposal: * change BS.findSubstring to be :: BS -> BS -> (BS, BS) in the style of List.break * remove the current BS.findSubstrings and of course to also add findSubstring with the equivalent type for the ByteString.Lazy module. Duncan

Duncan Coutts wrote:
So perhaps that's my straw-man proposal: * change BS.findSubstring to be :: BS -> BS -> (BS, BS) in the style of List.break * remove the current BS.findSubstrings
While List.break is useful, it has the equally useful counterpart (dropWhile . not . (==)) that doesn't accumulate the prefix of a match. For a long sequence, this has appeal. Let's say you're reading ten gigabytes of data over the network, so you have no control over the incoming chunk size (as we don't provide a rechunking mechanism at present, so this isn't a hypothetical issue). A findSubstring that accumulates the prefix could easily cons an fatally large number of chunks. I'm not saying that the signature you suggest shouldn't be present, merely that it's not enough: it wants a counterpart that accumulates either nothing or something safe like an Int64 that counts the length of the prefix.

Bryan O'Sullivan
Duncan Coutts wrote:
So perhaps that's my straw-man proposal: * change BS.findSubstring to be :: BS -> BS -> (BS, BS) in the style of List.break * remove the current BS.findSubstrings
While List.break is useful, it has the equally useful counterpart (dropWhile . not . (==)) that doesn't accumulate the prefix of a match. For a long sequence, this has appeal. Let's say you're reading ten gigabytes of data over the network, so you have no control over the incoming chunk size (as we don't provide a rechunking mechanism at present, so this isn't a hypothetical issue). A findSubstring that accumulates the prefix could easily cons an fatally large number of chunks.
I'm not saying that the signature you suggest shouldn't be present, merely that it's not enough: it wants a counterpart that accumulates either nothing or something safe like an Int64 that counts the length of the prefix.
I'm not familiar with Bytestrings so I'm probably out of my depth, but something that strikes me is that if you are returned an index to a large object like this, to use it as the offset it would have to be the offset from the beginning of the large object, which would cause the large object to be held in memory until the indexing/dropping expression is evaluated. Or is there some more sophisticated form of indexing for byte strings? -- Jón Fairbairn Jon.Fairbairn@cl.cam.ac.uk

Jon Fairbairn wrote:
I'm not familiar with Bytestrings so I'm probably out of my depth, but something that strikes me is that if you are returned an index to a large object like this, to use it as the offset it would have to be the offset from the beginning of the large object, which would cause the large object to be held in memory until the indexing/dropping expression is evaluated.
That would be the case if you intended to use the offset for subsequent operations on the prefix, but at least sometimes that's not what you need it for. For example, if you want to report a parse error due to a truncated input stream, having the offset lets you say precisely where the error occurred, even though you won't be using the prefix for anything. This use of an offset is allowed for in the machinery of Data.Binary, and I use it in some libraries on top.

My opinions on this are... Duncan Coutts wrote:
I want to start a discussion about a string searching api, especially with respect to ByteStrings.
I'm concerned that we're about to release a bytestring package along with ghc-6.8 that will freeze a less than ideal string search api in stone. We have an opportunity to make that better since at the moment I think there are few serious users of that api since the current implementation is very slow. However in this release cycle we're going to get optimised string searching algorithms and then I expect the number of users to pick up. I'd be a shame if that forces us to keep a not very nice api.
I think 'nice' needs to be pinned down. There are the semantic properties we need: (I) Must not prevent use of optimized algorithms (II) Must not force whole lazy string into memory (unless required) (III) Must not force holding onto start of lazy string Note that these are only an issue with ByteString.Lazy And the API characteristics: obvious/intuitive/consistent/easy to learn/powerful/... (pick some).
That said, here is the current api.
-- | Find the indexes of all (possibly overlapping) occurances of a -- substring in a string. findSubstrings :: ByteString -- ^ String to search for. -> ByteString -- ^ String to seach in. -> [Int]
This style is often what is desired. It should be offered.
-- | Get the first index of a substring in another string, -- or 'Nothing' if the string is not found. findSubstring :: ByteString -- ^ String to search for. -> ByteString -- ^ String to seach in. -> Maybe Int findSubstring p s = listToMaybe (findSubstrings p s)
-- | Check whether one string is a substring of another. isSubstringOf :: ByteString -- ^ String to search for. -> ByteString -- ^ String to search in. -> Bool isSubstringOf p s = not (null (findSubstrings p s))
Those two are convenience wrappers.
We only have it implemented for strict ByteStrings at the moment. This api isn't too bad for the strict ByteString implementation since we can use indexes to select substrings in O(1) time. However for lazy bytestrings index are much less good since we have to traverse the list of chunks to use the indexes.
A more compelling argument is that it's not very Haskelly to use indexes. We normally deal in spans rather than offsets and lengths.
Users may very well need the offsets, though.
One possible alternative that I think is more Haskelly and would suit the strict and lazy (and indeed ordinary list []) implementation is:
findSubstring :: ByteString -- ^ String to search for. -> ByteString -- ^ String to seach in. -> (ByteString, -- span before search string ByteString) -- span beginning search string, or empty
That means one needs to call some "length" function to get the offsets. This requires forcing the whole input string to get the length of the second found location. This is _not_ an alternative to [Int]
compare this to our standard List.break function.
break isSpace "foo bar" = ("foo", " bar") break isSpace "foobar" = ("foo bar", "")
So for findSubstring we'd have:
findSubstring "bar" "foo bar baz" = ("foo ","bar baz") findSubstring "notfound" "foo bar baz" = ("foo bar baz", "")
I quite like this as the primitive.
isSubstringOf is easily definable in terms of that.
I should note that there is really no performance penalty for working in terms of spans rather than indexes. Indeed for the lazy case it'd be faster.
For the single return of findSubstring, I agree the performance difference is small.
findSubstrings is much less clear. What's not clear is what the result type and content should be. The current impl that gives us a list of indexes of possibly-overlapping occurrences seems straightforward. What is the break/span style equivalent?
A couple possibilities:
findSubstrings :: ByteString -> ByteString -> [ByteString]
where each result begins an occurrence and includes the trailing text before the next occurrence. Of course that means we loose any initial span. So we could have:
findSubstrings :: ByteString -> ByteString -> (ByteString, [ByteString])
that includes the initial span.
Then there's the issue of overlapping spans or not. Overlapping seems more general but also seems confusing when done in terms of spans.
Or we could just side-step the issue and not provide a findSubstrings at all and let people implement the behavior they want in terms of findSubstring. Afterall there is no 'many' tokenising variant of List.span or List.break, probably because of the wide variation in requirements.
This violates the (I) property: Good search algorithms often work better finding all the hits for "findSubstrings" and work worse when built from the "findSubstring" primitive.
I'd very much appreciate peoples opinions on this. Though bear in mind that I'm not looking for a large string searching framework, I want to pick a supportable primitive for strict and lazy bytestring searching that can go into the bytestring library now so we can avoid freezing a worse api.
Duncan
The searching API of regex-base could be a model. It could even be re-used, as long as your API allows one to make RegexLike instances. This is just a slightly degenerate case of regex searching, after all. The Regular Expression analogy is a good one. Note that the KMP and Boyer-Moore algorithms pre-process the searched-for string much like a regular expression is compiled. This work can be cached by "saving" the one-argument curried version of the findSubstrings function when the same searched-for string is sought in several searched-in strings. (Imagine looking for the target string in a series of files). Thus KMP and Boyer-Moore could just be different back-ends for regex-base. But since finding literal string is a simpler case, it may well be that a simpler API is called for, with simpler type classes. Cheers, Chris Kuklewicz

haskell@list.mightyreason.com wrote:
Users may very well need the offsets, though. That means one needs to call some "length" function to get the offsets.
'length' of a ByteString is extremely cheap.
This requires forcing the whole input string to get the length of the second found location.
I don't see why it should force the whole input: only as much as is needed to find the desired substring. There is no need to call 'length' on the tail, since it doesn't yield any interesting information. Regards, Malcolm

On 9/18/07, Duncan Coutts
I want to start a discussion about a string searching api, especially with respect to ByteStrings.
My use case for searching bytestrings was to determine where a particular substring occurred in a search string. Thus, I needed the index. Taking the length of a string returned in a tuple would give me that index, but would feel dirty. I've seen DonS post enough to NOT take "length" that I wouldn't have thought to do that. I also wonder if the "tupling" style would make it easier to produce space leaks, especially if the "overlapping" behavior caused the tail of each string to be included. I don't quite understand sharing and the bytestring implementation to know how much of a concern that would be. I have to agree with Chris that I prefer findSubstrings to return a list of Int/Int64s, rather than the tuples. Justin

Justin Bailey wrote:
My use case for searching bytestrings was to determine where a particular substring occurred in a search string. Thus, I needed the index.
but what for? In a real application, I guess you want to do some further processing, so the index is just some intermediate result (you will acccess the string via the index later). The point now is to remove this intermediate altogether. (Cf. we don't have pointers - that's what distinguishes Haskell from C). Yes, space leaks might be a concern, but dangling (int) pointers are a concern as well: a concern for readability and safeness: if you have some Int, how do you know which string it points to, or whether it is a pointer or an offset to a pointer etc. (whenever I see an int/Int, I sense the "primitive obsession" smell ...) Best regards, Johannes Waldmann.

On 9/18/07, Johannes Waldmann
but what for? In a real application, I guess you want to do some further processing, so the index is just some intermediate result (you will acccess the string via the index later).
Actually, sometimes I needed the string from the beginning of the search to the match point (in which case the tuple would do what I wanted). In other cases, I needed to know the length of the matched string. In that case, knowing the index of the match made it very fast to calculate the length without looking at the string. While the length of a strict bytestring is very cheap (O(1)), the length of a lazy bytestring is not. Being able to do the above optimization was important to my application. Justin
participants (7)
-
Bryan O'Sullivan
-
Duncan Coutts
-
haskell@list.mightyreason.com
-
Johannes Waldmann
-
Jon Fairbairn
-
Justin Bailey
-
Malcolm Wallace