
Hi, I'm writing a Haskell program which is looking for whether or not a string (needle) exists within a larger string (haystack) given a Levenshtein distance. Note that it shouldn't calculate the Levenshtein distance between the needle and haystack. For example, needle: AAA haystack: TTTTTAATTTTT The needle would be "found" if the Levenshtein distance is set at 1, because dist("AAT", "AAA") == 1. The following almost works: distance(needle, haystack) - (len(haystack) - len(needle)) But it does not handle deletions correctly. I've previously written a program which does this approximate search in Java with the Wu-Manber extension of the Bitap, or shift-or, algorithm. The same strategy seems difficult to code up in Haskell because it's very "stateful" and involves a lot of bit-fiddling. (Maybe it's actually quite simple, but I'm not sure how I would go about doing this.) As a further complication, I'd prefer to keep the data packed as ByteStrings, as I'll be dealing with around 200 GiBs of data (split and parallelized over a cluster, but still a lot). I don't really know how to deal with ByteStrings without using the functions that come along in the ByteString module or unpacking the ByteString. I've been using the language-spelling package, which has a module which can calculate the Levenshtein distance between two ByteStrings. I see that it uses ListLike. I'm not really sure how it works, but I assume it makes ByteString an instance of ListLike, and then you have access to all of the ListLike methods? Does anyone have any advice on how I should proceed with this? I don't mind learning new things, but I don't really know what the best strategy is or where to look. -- Mason

Hi, The ListLike instance for ByteString is defined here: https://hackage.haskell.org/package/ListLike-3.1.7.1/docs/src/Data-ListLike-... The implementation of Levenshtein distance in language-spelling is here: https://hackage.haskell.org/package/language-spelling-0.3.2/docs/src/Languag... It seems to use only O(1) operations on the Bytestring: length and index (!!). It uses an unboxed mutable vector in the ST monad. It is specialized for ByteString. So I think it should be quite efficient already. If you want you can use the same approach to implement the other algorithm, with Data.Bits for the bit-fiddling: https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Bits.html Cheers, Sylvain On 27/04/2016 01:17, Mason Lai wrote:
Hi,
I'm writing a Haskell program which is looking for whether or not a string (needle) exists within a larger string (haystack) given a Levenshtein distance. Note that it shouldn't calculate the Levenshtein distance between the needle and haystack. For example,
needle: AAA haystack: TTTTTAATTTTT
The needle would be "found" if the Levenshtein distance is set at 1, because dist("AAT", "AAA") == 1.
The following almost works:
distance(needle, haystack) - (len(haystack) - len(needle))
But it does not handle deletions correctly.
I've previously written a program which does this approximate search in Java with the Wu-Manber extension of the Bitap, or shift-or, algorithm. The same strategy seems difficult to code up in Haskell because it's very "stateful" and involves a lot of bit-fiddling. (Maybe it's actually quite simple, but I'm not sure how I would go about doing this.)
As a further complication, I'd prefer to keep the data packed as ByteStrings, as I'll be dealing with around 200 GiBs of data (split and parallelized over a cluster, but still a lot). I don't really know how to deal with ByteStrings without using the functions that come along in the ByteString module or unpacking the ByteString.
I've been using the language-spelling package, which has a module which can calculate the Levenshtein distance between two ByteStrings. I see that it uses ListLike. I'm not really sure how it works, but I assume it makes ByteString an instance of ListLike, and then you have access to all of the ListLike methods?
Does anyone have any advice on how I should proceed with this? I don't mind learning new things, but I don't really know what the best strategy is or where to look.
-- Mason
_______________________________________________ Beginners mailing list Beginners@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/beginners
participants (2)
-
Mason Lai
-
Sylvain Henry