
Iustin Pop
On Tue, Mar 23, 2010 at 03:31:33PM -0400, Nick Bowler wrote:
So that's a 30% reduction in throughput. I'd say that's a lot worse than a few percentage points, but certainly not orders of magnitude.
Because you're possibly benchmarking the disk also. With a 400MB file on tmpfs, lazy bytestring readfile + length takes on my machine ~150ms, which is way faster than 8 seconds…
If you read the source code, length do not read the data, that's why it is so fast. It cannot be done for UTF-8 strings.
From Data.ByteString.Lazy:
-- | /O(n\/c)/ 'length' returns the length of a ByteString as an -- | 'Int64' length :: ByteString -> Int64 length cs = foldlChunks (\n c -> n + fromIntegral (S.length c)) 0 cs {-# INLINE length #-}
On the other hand, using Data.ByteString.Lazy.readFile and Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with approximately 5% of that time spent in GC, which is rather worse than the Prelude. Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are even worse, with results of around 25 *seconds* (!!) and 2% of that time spent in GC.
GNU wc computes the correct answer as quickly as lazy bytestrings compute the wrong answer. With perl 5.8, slurping the entire file as UTF-8 computes the correct answer just as slowly as Prelude. In my first ever Python program (with python 2.6), I tried to read the entire file as a unicode string and it quickly crashes due to running out of memory (yikes!), so it earns a DNF.
So, for computing the right answer with this simple test, it looks like the Prelude is the best option. We tie with Perl and lose only to GNU wc (which is written in C). Really, though, it would be nice to close that gap.
Totally agreed :)
texitoi@flyeeeng:~$ ./wc-utf8 /dev/shm/haskell-utf8.txt Normal String + System.IO "60452700": 5.575169s Data.ByteString.Lazy "61965200": 0.088136s Data.ByteString.Lazy.UTF8 "60452700": 13.953714s Cheating a little bit "60452700": 9.307322s Data.Text.Lazy "60452700": 15.608354s texitoi@flyeeeng:~$ time wc /dev/shm/haskell-utf8.txt 1329900 8065200 61965200 /dev/shm/haskell-utf8.txt real 0m9.303s user 0m9.089s sys 0m0.152s texitoi@flyeeeng:~$ Hey, normal string way faster than GNU wc! Cheat sheet, using Data.ByteString.Lazy: myLength :: U.ByteString -> Int myLength b = loop 0 b where loop n xs = case readChar xs of Just m -> let n' = n+1 in n' `seq` loop n' (L.drop m xs) Nothing -> n readChar :: L.ByteString -> Maybe Int64 readChar bs = do (c,_) <- L.uncons bs return (choose (fromEnum c)) where choose :: Int -> Int64 choose c | c < 0xc0 = 1 | c < 0xe0 = 2 | c < 0xf0 = 3 | c < 0xf8 = 4 | otherwise = 1 inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it is cheating because it do not check the validity of the multibyte char). Using Debian testing, ghc 6.12.1 on Atom N270 @ 1.6GHz. The file is a repeated LaTeX UTF8 file of about 60MB. -- Guillaume Pinot http://www.irccyn.ec-nantes.fr/~pinot/ « Les grandes personnes ne comprennent jamais rien toutes seules, et c'est fatigant, pour les enfants, de toujours leur donner des explications... » -- Antoine de Saint-Exupéry, Le Petit Prince () ASCII ribbon campaign -- Against HTML e-mail /\ http://www.asciiribbon.org -- Against proprietary attachments