
On Friday 13 August 2010 19:53:37, Bryan O'Sullivan wrote:
On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer
wrote: That's an unfortunate example. Using the stringsearch package, substring searching in ByteStrings was considerably faster than in Data.Text in my tests.
Interesting. Got a test case so I can repro and fix? :-)
Sure, use http://norvig.com/big.txt (~6.2M), cat it together a few times to test on larger files. ByteString code (bmLazy.hs): ---------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import System.Environment (getArgs) import qualified Data.ByteString.Char8 as C import qualified Data.ByteString.Lazy as L import Data.ByteString.Lazy.Search main :: IO () main = do (file : pat : _) <- getArgs let !spat = C.pack pat work = indices spat L.readFile file >>= print . length . work ---------------------------------------------------------------- Data.Text.Lazy (textLazy.hs): ---------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import System.Environment (getArgs) import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as TIO import Data.Text.Lazy.Search main :: IO () main = do (file : pat : _) <- getArgs let !spat = T.pack pat work = indices spat TIO.readFile file >>= print . length . work ---------------------------------------------------------------- (Data.Text.Lazy.Search is of course not exposed by default ;), I use text-0.7.2.1) Some local timings: 1. real words in a real text file: $ time ./textLazy big.txt the 92805 0.59user 0.00system 0:00.61elapsed 97%CPU $ time ./bmLazy big.txt the92805 0.02user 0.01system 0:00.04elapsed 104%CPU $ time ./textLazy big.txt and 43587 0.56user 0.01system 0:00.58elapsed 100%CPU $ time ./bmLazy big.txt and 43587 0.02user 0.01system 0:00.03elapsed 88%CPU $ time ./textLazy big.txt mother 317 0.44user 0.01system 0:00.46elapsed 99%CPU $ time ./bmLazy big.txt mother 317 0.00user 0.01system 0:00.02elapsed 69%CPU $ time ./textLazy big.txt deteriorate 2 0.37user 0.00system 0:00.38elapsed 98%CPU $ time ./bmLazy big.txt deteriorate 2 0.01user 0.01system 0:00.02elapsed 114%CPU $ time ./textLazy big.txt "Project Gutenberg" 177 0.37user 0.00system 0:00.38elapsed 97%CPU $ time ./bmLazy big.txt "Project Gutenberg" 177 0.00user 0.01system 0:00.01elapsed 100%CPU 2. periodic pattern in a file of 33.4M of aaaaa: $ time ./bmLazy ../AAA aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 34999942 1.22user 0.04system 0:01.30elapsed 97%CPU $ time ./textLazy ../AAA aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 593220 3.07user 0.03system 0:03.14elapsed 98%CPU Oh, that's closer, but text doesn't find overlapping matches, well, we can do that too (replace indices with nonOverlappingIndices): $ time ./noBMLazy ../AAA aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 593220 0.18user 0.04system 0:00.23elapsed 97%CPU Yeah, that's more like it :D