[Haskell-cafe] Re: Bytestrings and [Char]

23 Mar 2010

      Iustin Pop  writes:
...
On Tue, Mar 23, 2010 at 03:31:33PM -0400, Nick Bowler wrote:
...
So that's a 30% reduction in throughput.  I'd say that's a lot worse
than a few percentage points, but certainly not orders of magnitude.
Because you're possibly benchmarking the disk also. With a 400MB
file on tmpfs, lazy bytestring readfile + length takes on my machine
~150ms, which is way faster than 8 seconds…
If you read the source code, length do not read the data, that's why
it is so fast. It cannot be done for UTF-8 strings.
...
From Data.ByteString.Lazy:
-- | /O(n\/c)/ 'length' returns the length of a ByteString as an
-- | 'Int64'
length :: ByteString -> Int64
length cs = foldlChunks (\n c -> n + fromIntegral (S.length c)) 0 cs
{-# INLINE length #-}
...
...
On the other hand, using Data.ByteString.Lazy.readFile and
Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with
approximately 5% of that time spent in GC, which is rather worse than
the Prelude.  Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are
even worse, with results of around 25 *seconds* (!!) and 2% of that time
spent in GC.
GNU wc computes the correct answer as quickly as lazy bytestrings
compute the wrong answer.  With perl 5.8, slurping the entire file as
UTF-8 computes the correct answer just as slowly as Prelude.  In my
first ever Python program (with python 2.6), I tried to read the entire
file as a unicode string and it quickly crashes due to running out of
memory (yikes!), so it earns a DNF.
So, for computing the right answer with this simple test, it looks like
the Prelude is the best option.  We tie with Perl and lose only to GNU
wc (which is written in C).  Really, though, it would be nice to close
that gap.
Totally agreed :)
texitoi@flyeeeng:~$ ./wc-utf8 /dev/shm/haskell-utf8.txt
Normal String + System.IO "60452700": 5.575169s
Data.ByteString.Lazy "61965200": 0.088136s
Data.ByteString.Lazy.UTF8 "60452700": 13.953714s
Cheating a little bit "60452700": 9.307322s
Data.Text.Lazy "60452700": 15.608354s
texitoi@flyeeeng:~$ time wc /dev/shm/haskell-utf8.txt 
 1329900  8065200 61965200 /dev/shm/haskell-utf8.txt

real    0m9.303s
user    0m9.089s
sys     0m0.152s
texitoi@flyeeeng:~$ 

Hey, normal string way faster than GNU wc!

Cheat sheet, using Data.ByteString.Lazy:

myLength :: U.ByteString -> Int
myLength b = loop 0 b
  where loop n xs = case readChar xs of
                      Just m ->
                          let n' = n+1
                          in n' `seq` loop n' (L.drop m xs)
                      Nothing -> n

readChar :: L.ByteString -> Maybe Int64
readChar bs = do (c,_) <- L.uncons bs
                 return (choose (fromEnum c))
  where
  choose :: Int -> Int64
  choose c
    | c < 0xc0  = 1
    | c < 0xe0  = 2
    | c < 0xf0  = 3
    | c < 0xf8  = 4
    | otherwise = 1

inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it
is cheating because it do not check the validity of the multibyte char).

Using Debian testing, ghc 6.12.1 on Atom N270 @ 1.6GHz. The file is a
repeated LaTeX UTF8 file of about 60MB.

-- 
Guillaume Pinot               http://www.irccyn.ec-nantes.fr/~pinot/

« Les grandes personnes ne comprennent jamais rien toutes seules, et
c'est fatigant, pour les enfants, de toujours leur donner des
explications... » -- Antoine de Saint-Exupéry, Le Petit Prince

()  ASCII ribbon campaign      -- Against HTML e-mail
/\  http://www.asciiribbon.org -- Against proprietary attachments

[Haskell-cafe] Re: Bytestrings and [Char]

TeXitoi