[Haskell-beginners] Re: Re: Re: When to use ByteString rather than [Char] ... ?

11 Apr 2010

      On Sun, 2010-04-11 at 22:07 +0200, Daniel Fischer wrote:
...
Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
...
Of course:
 - I haven't done any tests. I guessed (which I written)
I just have done a test.
Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no 
characters outside latin1 range) and the same with
('\n':map toEnum [256 .. 10000] ++ "\n") appended.
Converted myspell polish dictonary (a few % of non-ascii chars) added
twice (6531616 bytes).
...
Code:
main = A.readFile "big.txt" >>= print . B.length
{-# LANGUAGE BangPatterns #-}
import Control.Applicative
import qualified Data.ByteString as S
import qualified Data.ByteString.UTF8 as SU
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Lazy.UTF8 as LU
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import Data.List hiding (find)
import Data.Time.Clock
import System.Mem
import System.IO hiding (readFile)
import Text.Printf
import Prelude hiding (readFile)

readFile :: String -> IO String
readFile p = do h <- openFile p ReadMode
                hSetEncoding h utf8
                hGetContents h

measure :: IO a -> IO (NominalDiffTime)
measure a = do performGC
               start <- getCurrentTime
               !_ <- a
               end <- getCurrentTime
               return $! end `diffUTCTime` start

find !x v | fromEnum v == 32 = x + 1
          | otherwise        = x

find' !x 'ą' = x + 1
find' !x 'Ą' = x + 1
find' !x  _  = x

main = printMeasure "Length - ByteString" (S.length <$> S.readFile
"dict") >>
       printMeasure "Length - Lazy ByteString" (L.length <$> L.readFile
"dict") >>
       printMeasure "Length - String" (length <$> readFile "dict") >>
       printMeasure "Length - UTF8 ByteString" (SU.length <$> S.readFile
"dict") >>
       printMeasure "Length - UTF8 Lazy ByteString" (LU.length <$>
L.readFile "dict") >>
       printMeasure "Length - Text" (T.length <$> T.readFile "dict") >>
       printMeasure "Length - Lazy Text" (TL.length <$> TL.readFile
"dict") >>
       printMeasure "Searching - ByteString" (S.foldl' find 0 <$>
S.readFile "dict") >>
       printMeasure "Searching - ByteString" (L.foldl' find 0 <$>
L.readFile "dict") >>
       printMeasure "Searching - String" (foldl' find 0 <$> readFile
"dict") >>
       printMeasure "Searching - UTF8 ByteString" (SU.foldl find 0 <$>
S.readFile "dict") >>
       printMeasure "Searching - UTF8 Lazy ByteString" (LU.foldl find 0
<$> L.readFile "dict") >>
       printMeasure "Searching - Text" (T.foldl' find 0 <$> T.readFile
"dict") >>
       printMeasure "Searching - Lazy Text" (TL.foldl' find 0 <$>
TL.readFile "dict") >>
       printMeasure "Searching ą - String" (foldl' find' 0 <$> readFile
"dict") >>
       printMeasure "Searching ą - UTF8 ByteString" (SU.foldl find' 0 <
$> S.readFile "dict") >>
       printMeasure "Searching ą - UTF8 Lazy ByteString" (LU.foldl find'
0 <$> L.readFile "dict") >>
       printMeasure "Searching ą - Text" (T.foldl' find' 0 <$>
T.readFile "dict") >>
       printMeasure "Searching ą - Lazy Text" (TL.foldl' find' 0 <$>
TL.readFile "dict")

printMeasure :: String -> IO a -> IO ()
printMeasure s a = measure a >>= \v -> printf "%-40s %8.5f s\n" (s ++
":") (realToFrac v :: Float)
...
where (A,B) is a suitable combination of 
- Data.ByteString[.Lazy][.Char8][.UTF8]
- Data.Text[.IO]
- Prelude
Times:
Data.ByteString[.Lazy]: 0.00s
Data.ByteString.UTF8: 0.14s
Prelude:  0.21s
Data.ByteString.Lazy.UTF8: 0.56s
Data.Text:  0.66s
Optimized:

Length - ByteString:                      0.01223 s
Length - Lazy ByteString:                 0.00328 s
Length - String:                          0.15474 s
Length - UTF8 ByteString:                 0.19945 s
Length - UTF8 Lazy ByteString:            0.30123 s
Length - Text:                            0.70438 s
Length - Lazy Text:                       0.62137 s

String seems to be fastest correct

Searching - ByteString:                   0.04604 s
Searching - ByteString:                   0.04424 s
Searching - String:                       0.18178 s
Searching - UTF8 ByteString:              0.32606 s
Searching - UTF8 Lazy ByteString:         0.42984 s
Searching - Text:                         0.26599 s
Searching - Lazy Text:                    0.37320 s

While ByteString is clear winner String is actually good compared to
others.

Searching ą - String:                     0.18557 s
Searching ą - UTF8 ByteString:            0.32752 s
Searching ą - UTF8 Lazy ByteString:       0.43811 s
Searching ą - Text:                       0.28401 s
Searching ą - Lazy Text:                  0.37612 

String is fastest? Hmmm.

                       Compiled:

Length - ByteString:                      0.00861 s
Length - Lazy ByteString:                 0.00409 s
Length - String:                          0.16059 s
Length - UTF8 ByteString:                 0.20165 s
Length - UTF8 Lazy ByteString:            0.31885 s
Length - Text:                            0.70891 s
Length - Lazy Text:                       0.65553 s

ByteString is also clear winner but String once again wins in 'correct'
section.

Searching - ByteString:                   1.27414 s
Searching - ByteString:                   1.27303 s
Searching - String:                       0.56831 s
Searching - UTF8 ByteString:              0.68742 s
Searching - UTF8 Lazy ByteString:         0.75883 s
Searching - Text:                         1.16121 s
Searching - Lazy Text:                    1.76678 s

I mean... what? I may be doing something wrong 

Searching ą - String:                     0.32612 s
Searching ą - UTF8 ByteString:            0.41564 s
Searching ą - UTF8 Lazy ByteString:       0.52919 s
Searching ą - Text:                       0.87463 s
Searching ą - Lazy Text:                  1.52369 s

No comment.

                       Intepreted

Length - ByteString:                      0.00511 s
Length - Lazy ByteString:                 0.00378 s
Length - String:                          0.16657 s
Length - UTF8 ByteString:                 0.21639 s
Length - UTF8 Lazy ByteString:            0.33952 s
Length - Text:                            0.79771 s
Length - Lazy Text:                       0.65320 s

As with others.

Searching - ByteString:                   9.12051 s
Searching - ByteString:                   8.94038 s
Searching - String:                       8.57391 s
Searching - UTF8 ByteString:              7.71766 s
Searching - UTF8 Lazy ByteString:         7.79422 s
Searching - Text:                         8.34435 s
Searching - Lazy Text:                    9.07538 s

Now they are pretty much equal.

Searching ą - String:                     3.17010 s
Searching ą - UTF8 ByteString:            3.94399 s
Searching ą - UTF8 Lazy ByteString:       3.92382 s
Searching ą - Text:                       3.32901 s
Searching ą - Lazy Text:                  4.18038 s

Hmm. Still the best?

Your test:
                        Optimized  Compiled  Interpreted
ByteString:             0.011      0.011     0.421
ByteString Lazy:        0.006      0.006     0.535
String:                 0.237      0.240     0.650
Text:                   0.767      0.720     1.192
Text Lazy:              0.661      0.614     1.061
ByteString UTF8:        0.204      0.204     0.631
ByteString Lazy UTF8:   0.386      0.309     0.744

System:
Core 2 Duo T9600 2.80 GHz, 2 GiB RAM
Gentoo Linux x86-64.
Linux 2.6.33 + gentoo patches + ck.
Glibc 2.11
GHC 6.12.1
base 4.2.0.0
bytestring 0.9.1.5
text 0.7.1.0
utf8-string 0.3.6

PS. Tests were repeated a few times and each gave similar results.
...
...
- It wasn't written what is the typical case
Aren't there several quite different typical cases?
One fairly typical case is big ASCII or latin1 files (e.g. fasta files, 
numerical data). For those, usually ByteString is by far the best choice.
On the other hand - if you load the numerical data it is likely that:
- It will have some labels. The labels can happen to need non-ascii or
non-latin elements
- Biggest time will be spent on operating on numbers then strings.
...
Another fairly typical case is *text* processing, possibly with text in 
different scripts (latin, hebrew, kanji, ...). Depending on what you want 
to do (and the encoding), any of Prelude.String, Data.Text and 
Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings 
probably aren't. String and Text also have the advantage that you aren't 
tied to utf-8.
Choose your datatype according to your problem, not one size fits all.
My measurements seems to prefer String but they are probably wrong.

Regards