
On Sun, 2010-04-11 at 22:07 +0200, Daniel Fischer wrote:
Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
Of course: - I haven't done any tests. I guessed (which I written)
I just have done a test. Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no characters outside latin1 range) and the same with ('\n':map toEnum [256 .. 10000] ++ "\n") appended.
Converted myspell polish dictonary (a few % of non-ascii chars) added twice (6531616 bytes).
Code:
main = A.readFile "big.txt" >>= print . B.length
{-# LANGUAGE BangPatterns #-} import Control.Applicative import qualified Data.ByteString as S import qualified Data.ByteString.UTF8 as SU import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.UTF8 as LU import qualified Data.Text as T import qualified Data.Text.IO as T import qualified Data.Text.Lazy as TL import qualified Data.Text.Lazy.IO as TL import Data.List hiding (find) import Data.Time.Clock import System.Mem import System.IO hiding (readFile) import Text.Printf import Prelude hiding (readFile) readFile :: String -> IO String readFile p = do h <- openFile p ReadMode hSetEncoding h utf8 hGetContents h measure :: IO a -> IO (NominalDiffTime) measure a = do performGC start <- getCurrentTime !_ <- a end <- getCurrentTime return $! end `diffUTCTime` start find !x v | fromEnum v == 32 = x + 1 | otherwise = x find' !x 'ą' = x + 1 find' !x 'Ą' = x + 1 find' !x _ = x main = printMeasure "Length - ByteString" (S.length <$> S.readFile "dict") >> printMeasure "Length - Lazy ByteString" (L.length <$> L.readFile "dict") >> printMeasure "Length - String" (length <$> readFile "dict") >> printMeasure "Length - UTF8 ByteString" (SU.length <$> S.readFile "dict") >> printMeasure "Length - UTF8 Lazy ByteString" (LU.length <$> L.readFile "dict") >> printMeasure "Length - Text" (T.length <$> T.readFile "dict") >> printMeasure "Length - Lazy Text" (TL.length <$> TL.readFile "dict") >> printMeasure "Searching - ByteString" (S.foldl' find 0 <$> S.readFile "dict") >> printMeasure "Searching - ByteString" (L.foldl' find 0 <$> L.readFile "dict") >> printMeasure "Searching - String" (foldl' find 0 <$> readFile "dict") >> printMeasure "Searching - UTF8 ByteString" (SU.foldl find 0 <$> S.readFile "dict") >> printMeasure "Searching - UTF8 Lazy ByteString" (LU.foldl find 0 <$> L.readFile "dict") >> printMeasure "Searching - Text" (T.foldl' find 0 <$> T.readFile "dict") >> printMeasure "Searching - Lazy Text" (TL.foldl' find 0 <$> TL.readFile "dict") >> printMeasure "Searching ą - String" (foldl' find' 0 <$> readFile "dict") >> printMeasure "Searching ą - UTF8 ByteString" (SU.foldl find' 0 < $> S.readFile "dict") >> printMeasure "Searching ą - UTF8 Lazy ByteString" (LU.foldl find' 0 <$> L.readFile "dict") >> printMeasure "Searching ą - Text" (T.foldl' find' 0 <$> T.readFile "dict") >> printMeasure "Searching ą - Lazy Text" (TL.foldl' find' 0 <$> TL.readFile "dict") printMeasure :: String -> IO a -> IO () printMeasure s a = measure a >>= \v -> printf "%-40s %8.5f s\n" (s ++ ":") (realToFrac v :: Float)
where (A,B) is a suitable combination of - Data.ByteString[.Lazy][.Char8][.UTF8] - Data.Text[.IO] - Prelude
Times: Data.ByteString[.Lazy]: 0.00s Data.ByteString.UTF8: 0.14s Prelude: 0.21s Data.ByteString.Lazy.UTF8: 0.56s Data.Text: 0.66s
Optimized: Length - ByteString: 0.01223 s Length - Lazy ByteString: 0.00328 s Length - String: 0.15474 s Length - UTF8 ByteString: 0.19945 s Length - UTF8 Lazy ByteString: 0.30123 s Length - Text: 0.70438 s Length - Lazy Text: 0.62137 s String seems to be fastest correct Searching - ByteString: 0.04604 s Searching - ByteString: 0.04424 s Searching - String: 0.18178 s Searching - UTF8 ByteString: 0.32606 s Searching - UTF8 Lazy ByteString: 0.42984 s Searching - Text: 0.26599 s Searching - Lazy Text: 0.37320 s While ByteString is clear winner String is actually good compared to others. Searching ą - String: 0.18557 s Searching ą - UTF8 ByteString: 0.32752 s Searching ą - UTF8 Lazy ByteString: 0.43811 s Searching ą - Text: 0.28401 s Searching ą - Lazy Text: 0.37612 String is fastest? Hmmm. Compiled: Length - ByteString: 0.00861 s Length - Lazy ByteString: 0.00409 s Length - String: 0.16059 s Length - UTF8 ByteString: 0.20165 s Length - UTF8 Lazy ByteString: 0.31885 s Length - Text: 0.70891 s Length - Lazy Text: 0.65553 s ByteString is also clear winner but String once again wins in 'correct' section. Searching - ByteString: 1.27414 s Searching - ByteString: 1.27303 s Searching - String: 0.56831 s Searching - UTF8 ByteString: 0.68742 s Searching - UTF8 Lazy ByteString: 0.75883 s Searching - Text: 1.16121 s Searching - Lazy Text: 1.76678 s I mean... what? I may be doing something wrong Searching ą - String: 0.32612 s Searching ą - UTF8 ByteString: 0.41564 s Searching ą - UTF8 Lazy ByteString: 0.52919 s Searching ą - Text: 0.87463 s Searching ą - Lazy Text: 1.52369 s No comment. Intepreted Length - ByteString: 0.00511 s Length - Lazy ByteString: 0.00378 s Length - String: 0.16657 s Length - UTF8 ByteString: 0.21639 s Length - UTF8 Lazy ByteString: 0.33952 s Length - Text: 0.79771 s Length - Lazy Text: 0.65320 s As with others. Searching - ByteString: 9.12051 s Searching - ByteString: 8.94038 s Searching - String: 8.57391 s Searching - UTF8 ByteString: 7.71766 s Searching - UTF8 Lazy ByteString: 7.79422 s Searching - Text: 8.34435 s Searching - Lazy Text: 9.07538 s Now they are pretty much equal. Searching ą - String: 3.17010 s Searching ą - UTF8 ByteString: 3.94399 s Searching ą - UTF8 Lazy ByteString: 3.92382 s Searching ą - Text: 3.32901 s Searching ą - Lazy Text: 4.18038 s Hmm. Still the best? Your test: Optimized Compiled Interpreted ByteString: 0.011 0.011 0.421 ByteString Lazy: 0.006 0.006 0.535 String: 0.237 0.240 0.650 Text: 0.767 0.720 1.192 Text Lazy: 0.661 0.614 1.061 ByteString UTF8: 0.204 0.204 0.631 ByteString Lazy UTF8: 0.386 0.309 0.744 System: Core 2 Duo T9600 2.80 GHz, 2 GiB RAM Gentoo Linux x86-64. Linux 2.6.33 + gentoo patches + ck. Glibc 2.11 GHC 6.12.1 base 4.2.0.0 bytestring 0.9.1.5 text 0.7.1.0 utf8-string 0.3.6 PS. Tests were repeated a few times and each gave similar results.
- It wasn't written what is the typical case
Aren't there several quite different typical cases? One fairly typical case is big ASCII or latin1 files (e.g. fasta files, numerical data). For those, usually ByteString is by far the best choice.
On the other hand - if you load the numerical data it is likely that: - It will have some labels. The labels can happen to need non-ascii or non-latin elements - Biggest time will be spent on operating on numbers then strings.
Another fairly typical case is *text* processing, possibly with text in different scripts (latin, hebrew, kanji, ...). Depending on what you want to do (and the encoding), any of Prelude.String, Data.Text and Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings probably aren't. String and Text also have the advantage that you aren't tied to utf-8.
Choose your datatype according to your problem, not one size fits all.
My measurements seems to prefer String but they are probably wrong. Regards