ByteString search code available in easy-to-digest form

I've packaged up the fast Boyer-Moore and Knuth-Morris-Pratt code that Chris Kuklewicz posted a few months ago: http://article.gmane.org/gmane.comp.lang.haskell.libraries/7363 The consensus at the time was that the code was not ready for rolling into the bytestring package, but now it's easy to install and start working with. API docs: http://darcs.serpentine.com/stringsearch/dist/doc/html/stringsearch/ Patches against the darcs repo welcome: darcs get http://darcs.serpentine.com/stringsearch Credit to Justin Bailey, Daniel Fischer, and Chris Kuklewicz for their hard work. (Currently only tested against GHC 6.6.1, FYI.)

bos:
I've packaged up the fast Boyer-Moore and Knuth-Morris-Pratt code that Chris Kuklewicz posted a few months ago:
http://article.gmane.org/gmane.comp.lang.haskell.libraries/7363
The consensus at the time was that the code was not ready for rolling into the bytestring package, but now it's easy to install and start working with.
API docs:
http://darcs.serpentine.com/stringsearch/dist/doc/html/stringsearch/
Patches against the darcs repo welcome:
darcs get http://darcs.serpentine.com/stringsearch
Credit to Justin Bailey, Daniel Fischer, and Chris Kuklewicz for their hard work.
(Currently only tested against GHC 6.6.1, FYI.)
Do we have any benchmarks, for say, 1G files, versus linear, naive (strict) search? -- Don

On Nov 7, 2007 2:21 PM, Bryan O'Sullivan
Chris mentioned that he did, but I haven't had time to write anything benchmarky yet.
I used the attached program to benchmark the various functions against "endo.dna"[1], a 7 MB file that came with this year's ICFP contest. It appends a pattern that occurs nowhere in the file to the end of that file and then searches for it. Strict and lazy bytestring searches using KMP are performed, plus a search using the existing bytestring searches and using a List. You'll have to change the import from Data.ByteString.KMP for it to compile but otherwise it should work out of the box.. Justin [1] http://www.icfpcontest.org/endo.zip

Bryan O'Sullivan wrote:
Patches against the darcs repo welcome:
darcs get http://darcs.serpentine.com/stringsearch
Credit to Justin Bailey, Daniel Fischer, and Chris Kuklewicz for their hard work.
(Currently only tested against GHC 6.6.1, FYI.)
Does not compile with GHC 6.8.1:
runhaskell Setup.lhs build Preprocessing library stringsearch-0.1.1... Building stringsearch-0.1.1...
Data/ByteString/Search/BoyerMoore.hs:59:7: Could not find module `Data.Array.Unboxed': it is a member of package array-0.1.0.0, which is hidden

Yeah, my code wants to open up the internals of Lazy bytestrings. Until recently this was possible toChunks, but it would be best to rewrite it for the newest Lazy representation (which comes with new shiny ghc 6.8.1). It is a trivial change, but I due to ghc-6.8.1 failing on ppc G4 OS X, I won't be updating it myself at the moment. -- Chris Niko Korhonen wrote:
Bryan O'Sullivan wrote:
Patches against the darcs repo welcome:
darcs get http://darcs.serpentine.com/stringsearch
Credit to Justin Bailey, Daniel Fischer, and Chris Kuklewicz for their hard work.
(Currently only tested against GHC 6.6.1, FYI.)
Does not compile with GHC 6.8.1:
runhaskell Setup.lhs build Preprocessing library stringsearch-0.1.1... Building stringsearch-0.1.1...
Data/ByteString/Search/BoyerMoore.hs:59:7: Could not find module `Data.Array.Unboxed': it is a member of package array-0.1.0.0, which is hidden

ChrisK wrote:
Yeah, my code wants to open up the internals of Lazy bytestrings. Until recently this was possible toChunks, but it would be best to rewrite it for the newest Lazy representation (which comes with new shiny ghc 6.8.1).
I've updated the stringsearch package on hackage so that it ought to compile against both 6.6.1 and 6.8.1. The toChunks function hasn't changed in any way.

bos:
ChrisK wrote:
Yeah, my code wants to open up the internals of Lazy bytestrings. Until recently this was possible toChunks, but it would be best to rewrite it for the newest Lazy representation (which comes with new shiny ghc 6.8.1).
I've updated the stringsearch package on hackage so that it ought to compile against both 6.6.1 and 6.8.1. The toChunks function hasn't changed in any way.
And is the representation-indepedent api anyway, so should be preferred.
participants (5)
-
Bryan O'Sullivan
-
ChrisK
-
Don Stewart
-
Justin Bailey
-
Niko Korhonen