When to use ByteString rather than [Char] ... ?

James Fisher

11 Apr 2010 11 Apr '10

11:07 a.m.

Hi, After working through a few Haskell tutorials, I've come across numerous recommendations to use the Data.ByteString library rather than standard [Char], for reasons of "performance". I'm having trouble swallowing this -- presumably the standard String is default for good reasons. Nothing has answered this question: in what case is it better to use [Char]? Could anyone point me to a good resource showing the differences between how [Char] and ByteString are implemented, and giving good a heuristic for me to decide which is better in any one case? Best, James Fisher

Attachments:

attachment.html (text/html — 643 bytes)

Show replies by date

Mukhamed Karanashev

11 Apr 11 Apr

12:11 p.m.

New subject: When to use ByteString rather than [Char] ... ?

Hi, James. Because "String" is represented by the "[Char]", each element of String is allocated individually. ByteString represents a string in a single array and hasn't overhead for allocating each char. I recommend you to read the "Real World Haskell" - http://book.realworldhaskell.org/read/. There is some notes about String/ByteString in the Chapter 8 - http://book.realworldhaskell.org/read/efficient-file-processing-regular-expr... On Sun, Apr 11, 2010 at 3:07 PM, James Fisher wrote:

...

Hi,

After working through a few Haskell tutorials, I've come across numerous recommendations to use the Data.ByteString library rather than standard [Char], for reasons of "performance". I'm having trouble swallowing this -- presumably the standard String is default for good reasons. Nothing has answered this question: in what case is it better to use [Char]?

Could anyone point me to a good resource showing the differences between how [Char] and ByteString are implemented, and giving good a heuristic for me to decide which is better in any one case?

Best,

James Fisher

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Maciej Piechotka

1:08 p.m.

New subject: When to use ByteString rather than [Char] ... ?

On Sun, 2010-04-11 at 16:11 +0400, Mukhamed Karanashev wrote:

...

Hi, James. Because "String" is represented by the "[Char]", each element of String is allocated individually. ByteString represents a string in a single array and hasn't overhead for allocating each char.

Hmm. I read somewhere that GHC optimizes it to handle it more as char * if possible. Could anyone clarify it? Regards

Stephen Tetley

1:29 p.m.

New subject: When to use ByteString rather than [Char] ... ?

Hi James There's a paper describing the implementation of ByteStrings here: http://www.cse.unsw.edu.au/~dons/papers/CSL06.html http://www.cse.unsw.edu.au/~dons/papers/fusion.pdf For my own work, I generally need short immutable strings and haven't found ByteStrings compelling, though the results presented in the above suggest [Char] is better at nothing and worse at many things. [Hmm - insert emoticon here] Best wishes Stephen

Daniel Fischer

2:42 p.m.

New subject: When to use ByteString rather than [Char] ...?

Am Sonntag 11 April 2010 15:29:38 schrieb Stephen Tetley:

...

Hi James

There's a paper describing the implementation of ByteStrings here:

http://www.cse.unsw.edu.au/~dons/papers/CSL06.html http://www.cse.unsw.edu.au/~dons/papers/fusion.pdf

For my own work, I generally need short immutable strings and haven't found ByteStrings compelling,

ByteStrings shine for long strings. When you're using long strings, ByteStrings almost certainly are *much* faster (utf8-ByteStrings are probably significantly slower, but should still beat [Char] comfortably). I've found ByteStrings better than [Char] when dealing with short strings only for a few things (e.g. as keys of Maps, ByteStrings tend to be better [at least if using ByteStrings there doesn't introduce too much packing and unpacking], things like edit-distance are faster on ByteStrings; UArray Int Char is slower than ByteString [in my measurements] for these tasks, but it can also be used for characters > toEnum 255 and isn't too much slower). Other things [see below] were faster for short [Char] than for short ByteStrings. When dealing with short strings, in my experience there are rarely compelling reasons to choose one over the other.

...

though the results presented in the above suggest [Char] is better at nothing

[Char] is (far) better at sorting short Strings; it often is better for map and filter.

...

and worse at many things.

[Char]-IO is abysmally slow in comparison, [Char] uses much more memory, random access is horrible for lists.

...

[Hmm - insert emoticon here]

Best wishes

Stephen

Stephen Tetley

3:15 p.m.

New subject: When to use ByteString rather than [Char] ...?

On 11 April 2010 15:42, Daniel Fischer wrote: [SNIP]

...

When dealing with short strings, in my experience there are rarely compelling reasons to choose one over the other.

Hi Daniel Thanks - I was slightly surprised at the results in the paper because the 'cons' test for was equal, I thought bytestrings have to do a bit more work for a 'cons' - looking at the code lazy bytestring uses one constructor and a bit of C memory poking, which is the C memory poking more than I'd expect the [Char] version to do. The only 'determinant' I've found for choosing which type for short strings is if I'm using a library that forces one or the other on me, otherwise I'm swayed by the simplicity of [Char]. Best wishes Stephen

Daniel Fischer

3:55 p.m.

New subject: When to use ByteString rather than [Char] ...?

Am Sonntag 11 April 2010 17:15:22 schrieb Stephen Tetley:

...

On 11 April 2010 15:42, Daniel Fischer wrote: [SNIP]

...
When dealing with short strings, in my experience there are rarely compelling reasons to choose one over the other.

Hi Daniel

Thanks - I was slightly surprised at the results in the paper because the 'cons' test for was equal, I thought bytestrings have to do a bit more work for a 'cons' - looking at the code lazy bytestring uses one constructor and a bit of C memory poking, which is the C memory poking more than I'd expect the [Char] version to do.

Well, I guess it depends on what actually happens with fusion (a single cons doesn't take significant time for either). If repeated conses lead to a chain of one-element chunks, I'd expect that to be significantly slower than [Char], but if it's rewritten to - allocate a new chunk, - write from end and decrement offset counter, it shouldn't be slower.

...

The only 'determinant' I've found for choosing which type for short strings is if I'm using a library that forces one or the other on me,

Sure, that's pretty compelling - as long as you don't need two libraries with different choices :)

...

otherwise I'm swayed by the simplicity of [Char].

Best wishes

Stephen

Maciej Piechotka

1:31 p.m.

New subject: When to use ByteString rather than [Char] ... ?

On Sun, 2010-04-11 at 12:07 +0100, James Fisher wrote:

...

Hi,

After working through a few Haskell tutorials, I've come across numerous recommendations to use the Data.ByteString library rather than standard [Char], for reasons of "performance". I'm having trouble swallowing this -- presumably the standard String is default for good reasons. Nothing has answered this question: in what case is it better to use [Char]?

In most cases you need an actuall String and it is not time-critical I believe. ByteString is... well string of bytes not char - you have no idea whether they are encoded as utf-8, ucs-2, ascii, iso-8859-1 (or as jpeg ;) ). If you want the next char you don't know how many bytes you need to read (1? 2? 3? depends on contents?). String ([Char]) have defined representation - while read/write function might incorrect encode/decode it (up to GHC 6.12 System.IO had assumes ascii encoding IIRC on read) it is their error.

...

Could anyone point me to a good resource showing the differences between how [Char] and ByteString are implemented, and giving good a heuristic for me to decide which is better in any one case?

ByteString is pointer with offset and length. Lazy ByteString is a linked list of ByteStrings (with additional condition that none of inner ByteStrings are empty). In theory String is [Char] i.e. [a] i.e. data [a] = [] | a:[a] In other words it is linked list of characters. That, for long strings, may be inefficient (because of cache, O(n) on random access and necessity of checking for errors while evaluating further[1]). I heard somewhere that actual implementations optimizes it to arrays when it is possible (i.e. can be detected and does not messes with non-strict semantics). However I don't know if it is true. I *guess* that in most cases the overhead on I/O will be sufficiently great to make the difference insignificant. However: - If you need exact byte representation - for example for compression, digital signatures etc. you need ByteString - If you need to operate on text rather then bytes use String or specialized data structures as Data.Text & co. - If you don't care about performance and need easy of use (pattern matching etc.) use String. - If you have no special requirements than you can ByteString While some languages (for example C, Python, Ruby) mixes the text and it's representation I guess it is not always the best way. String in such separation is an text while ByteString is a binary representation of something (can be text, picture, compresses data etc.).

...

Best,

James Fisher

Regards [1] However the O(n) access time and checking of errors are still introduced by decoding string. So if you need UTF-8 you will still get the O(n) access time ;)

Daniel Fischer

3:17 p.m.

New subject: When to use ByteString rather than [Char] ... ?

Am Sonntag 11 April 2010 15:31:52 schrieb Maciej Piechotka:

...

On Sun, 2010-04-11 at 12:07 +0100, James Fisher wrote:

...
Hi,

After working through a few Haskell tutorials, I've come across numerous recommendations to use the Data.ByteString library rather than standard [Char], for reasons of "performance". I'm having trouble swallowing this -- presumably the standard String is default for good reasons.

But performance is none of those reasons. The choice to make String a synonym for [Char] instead of a specialised datatype allows you to easily manipulate strings with the plethora of list- processing functions from the standard libraries. However, that means some things can't be fast and there's a significant space overhead for string handling.

...

...
Nothing has answered this question: in what case is it better to use [Char]?

In most cases you need an actuall String and it is not time-critical I believe. ByteString is... well string of bytes not char - you have no idea whether they are encoded as utf-8, ucs-2, ascii, iso-8859-1 (or as jpeg ;) ). If you want the next char you don't know how many bytes you need to read (1? 2? 3? depends on contents?).

And - sorting short strings - for using map or filter, [Char] is often superior

...

String ([Char]) have defined representation - while read/write function might incorrect encode/decode it (up to GHC 6.12 System.IO had assumes ascii encoding IIRC on read) it is their error.

...
Could anyone point me to a good resource showing the differences between how [Char] and ByteString are implemented, and giving good a heuristic for me to decide which is better in any one case?

ByteString is pointer with offset and length. Lazy ByteString is a linked list of ByteStrings (with additional condition that none of inner ByteStrings are empty).

And it's a head-strict list, not the usual lazy Haskell list.

...

In theory String is [Char] i.e. [a] i.e.

data [a] = [] | a:[a]

In other words it is linked list of characters. That, for long strings, may be inefficient (because of cache, O(n) on random access and necessity of checking for errors while evaluating further[1]).

I heard somewhere that actual implementations optimizes it to arrays when it is possible (i.e. can be detected and does not messes with non-strict semantics). However I don't know if it is true.

I've never heard of that before, so I'm skeptical.

...

I *guess* that in most cases the overhead on I/O will be sufficiently great to make the difference insignificant. However:

? which difference? Try reading large files. Count the lines or something else, as long as it's simple. The speed difference between ByteString-IO and [Char]-IO is enormous. When you do something more complicated the difference in IO-speed may become insignificant. On the other hand, when you're appending a lot of short lines to a file one by one, there's a good chance that [Char]-IO is actually faster.

...

- If you need exact byte representation - for example for compression, digital signatures etc. you need ByteString - If you need to operate on text rather then bytes use String or specialized data structures as Data.Text & co. - If you don't care about performance and need easy of use (pattern matching etc.) use String. - If you have no special requirements than you can ByteString

While some languages (for example C, Python, Ruby) mixes the text and it's representation I guess it is not always the best way. String in such separation is an text while ByteString is a binary representation of something (can be text, picture, compresses data etc.).

...
Best,

James Fisher

Regards

[1] However the O(n) access time and checking of errors are still introduced by decoding string. So if you need UTF-8 you will still get the O(n) access time ;)

It might then be a good idea to use a UArray Int Char if you need repeated random access.

Maciej Piechotka

4:04 p.m.

New subject: When to use ByteString rather than [Char] ... ?

On Sun, 2010-04-11 at 17:17 +0200, Daniel Fischer wrote:

...

...
I *guess* that in most cases the overhead on I/O will be

sufficiently

...
great to make the difference insignificant. However:

? which difference?

Try reading large files.

Well - while large files are not not-important IIRC most files are small (< 4 KiB) - at least on *nix file systems (at least that's the core 'idea' of reiserfs/reiser4 filesystems). I guess that for large strings something like text (I think I mentioned it) is better

...

Count the lines or something else, as long as it's simple. The speed difference between ByteString-IO and [Char]-IO is enormous. When you do something more complicated the difference in IO-speed may become insignificant.

Hmm. As newline is a single-byte character in most encodings it is believable. However what is the difference in counting chars (not bytes - chars)? I wouldn't be surprise is difference was smaller. Of course: - I haven't done any tests. I guessed (which I written) - It wasn't written what is the typical case - What is 'significant' difference Regards

Daniel Fischer

8:07 p.m.

New subject: When to use ByteString rather than [Char] ... ?

Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:

...

On Sun, 2010-04-11 at 17:17 +0200, Daniel Fischer wrote:

...
...
I *guess* that in most cases the overhead on I/O will be

sufficiently

...
great to make the difference insignificant. However:

? which difference?

I meant: difference between ByteString-IO and [Char]-IO or which difference?

...

...
Try reading large files.

Well - while large files are not not-important IIRC most files are small (< 4 KiB) - at least on *nix file systems (at least that's the core 'idea' of reiserfs/reiser4 filesystems).

Well, sometimes one has to process large files even though most are small. If the processing itself is simple, IO-speed is important then.

...

I guess that for large strings something like text (I think I mentioned it) is better

Unless you know you only have to deal with one-byte characters, when plain ByteStrings are the simplest and fastest method. But those are special cases, in general I agree.

...

...
Count the lines or something else, as long as it's simple. The speed difference between ByteString-IO and [Char]-IO is enormous. When you do something more complicated the difference in IO-speed may become insignificant.

Hmm. As newline is a single-byte character in most encodings it is believable.

You can measure it yourself :) cat-ing together a few copies of /usr/share/dict/words should give a large enough file.

...

However what is the difference in counting chars (not bytes - chars)? I wouldn't be surprise is difference was smaller.

Nor would I. In fact I'd be surprised if it wasn't smaller. [see below] This example was meant to illustrate the difference in IO-speed, so an extremely simple processing was appropriate. The combination of doing IO and processing is something different. If you're doing complicated things, IO time has a good chance to become negligible.

...

Of course: - I haven't done any tests. I guessed (which I written)

I just have done a test. Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no characters outside latin1 range) and the same with ('\n':map toEnum [256 .. 10000] ++ "\n") appended. Code: main = A.readFile "big.txt" >>= print . B.length where (A,B) is a suitable combination of - Data.ByteString[.Lazy][.Char8][.UTF8] - Data.Text[.IO] - Prelude Times: Data.ByteString[.Lazy]: 0.00s Data.ByteString.UTF8: 0.14s Prelude: 0.21s Data.ByteString.Lazy.UTF8: 0.56s Data.Text: 0.66s Of course Data.ByteString didn't count characters but bytes, so for the modified file, those printed larger numbers than the others (well, it's BYTEString, isn't it?). It's a little unfair, though, as the ByteString[.Lazy] variants don't need to look at each individual byte, so I also let them and Prelude.String count newlines to see how fast they can inspect each character/byte, BS[.Lazy]: 0.02s Prelude: 0.23s both take 0.02s to inspect each item. To summarise: * ByteString-IO is blazingly fast, since all it has to do is get a sequence of bytes from disk into memory. * [Char]-IO is much slower because it has to transform the sequence of bytes to individual characters as they come. * counting utf-8 encoded characters in a ByteString is - unsurprisingly - slow. I'm a bit surprised *how* slow it is for lazy ByteStrings. (Caveat: I've no idea whether Data.ByteString.UTF8 would suffer from more multi-byte characters to the point where String becomes faster. My guess is no, not for single traversal. For multiple traversal, String has to identify each individual character only once, while BS.UTF8 must do it each time, so then String may be faster.) * Data.Text isn't very fast for that one.

...

- It wasn't written what is the typical case

Aren't there several quite different typical cases? One fairly typical case is big ASCII or latin1 files (e.g. fasta files, numerical data). For those, usually ByteString is by far the best choice. Another fairly typical case is *text* processing, possibly with text in different scripts (latin, hebrew, kanji, ...). Depending on what you want to do (and the encoding), any of Prelude.String, Data.Text and Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings probably aren't. String and Text also have the advantage that you aren't tied to utf-8. Choose your datatype according to your problem, not one size fits all.

...

- What is 'significant' difference

Depends of course. For a task performed once, who cares whether it takes one second or three? One hour or three, however, is a significant difference (assuming approximately equal times to write the code). Sometimes 10% difference in performance is important, sometimes a factor of 10 isn't. The point is that you should be aware of the performance differences when making your choice.

...

Regards

Cheers, Daniel

Maciej Piechotka

11:01 p.m.

New subject: When to use ByteString rather than [Char] ... ?

On Sun, 2010-04-11 at 22:07 +0200, Daniel Fischer wrote:

...

Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:

...
Of course: - I haven't done any tests. I guessed (which I written)

I just have done a test. Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no characters outside latin1 range) and the same with ('\n':map toEnum [256 .. 10000] ++ "\n") appended.

Converted myspell polish dictonary (a few % of non-ascii chars) added twice (6531616 bytes).

...

Code:

main = A.readFile "big.txt" >>= print . B.length

{-# LANGUAGE BangPatterns #-} import Control.Applicative import qualified Data.ByteString as S import qualified Data.ByteString.UTF8 as SU import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.UTF8 as LU import qualified Data.Text as T import qualified Data.Text.IO as T import qualified Data.Text.Lazy as TL import qualified Data.Text.Lazy.IO as TL import Data.List hiding (find) import Data.Time.Clock import System.Mem import System.IO hiding (readFile) import Text.Printf import Prelude hiding (readFile) readFile :: String -> IO String readFile p = do h <- openFile p ReadMode hSetEncoding h utf8 hGetContents h measure :: IO a -> IO (NominalDiffTime) measure a = do performGC start <- getCurrentTime !_ <- a end <- getCurrentTime return $! end `diffUTCTime` start find !x v | fromEnum v == 32 = x + 1 | otherwise = x find' !x 'ą' = x + 1 find' !x 'Ą' = x + 1 find' !x _ = x main = printMeasure "Length - ByteString" (S.length <$> S.readFile "dict") >> printMeasure "Length - Lazy ByteString" (L.length <$> L.readFile "dict") >> printMeasure "Length - String" (length <$> readFile "dict") >> printMeasure "Length - UTF8 ByteString" (SU.length <$> S.readFile "dict") >> printMeasure "Length - UTF8 Lazy ByteString" (LU.length <$> L.readFile "dict") >> printMeasure "Length - Text" (T.length <$> T.readFile "dict") >> printMeasure "Length - Lazy Text" (TL.length <$> TL.readFile "dict") >> printMeasure "Searching - ByteString" (S.foldl' find 0 <$> S.readFile "dict") >> printMeasure "Searching - ByteString" (L.foldl' find 0 <$> L.readFile "dict") >> printMeasure "Searching - String" (foldl' find 0 <$> readFile "dict") >> printMeasure "Searching - UTF8 ByteString" (SU.foldl find 0 <$> S.readFile "dict") >> printMeasure "Searching - UTF8 Lazy ByteString" (LU.foldl find 0 <$> L.readFile "dict") >> printMeasure "Searching - Text" (T.foldl' find 0 <$> T.readFile "dict") >> printMeasure "Searching - Lazy Text" (TL.foldl' find 0 <$> TL.readFile "dict") >> printMeasure "Searching ą - String" (foldl' find' 0 <$> readFile "dict") >> printMeasure "Searching ą - UTF8 ByteString" (SU.foldl find' 0 < $> S.readFile "dict") >> printMeasure "Searching ą - UTF8 Lazy ByteString" (LU.foldl find' 0 <$> L.readFile "dict") >> printMeasure "Searching ą - Text" (T.foldl' find' 0 <$> T.readFile "dict") >> printMeasure "Searching ą - Lazy Text" (TL.foldl' find' 0 <$> TL.readFile "dict") printMeasure :: String -> IO a -> IO () printMeasure s a = measure a >>= \v -> printf "%-40s %8.5f s\n" (s ++ ":") (realToFrac v :: Float)

...

where (A,B) is a suitable combination of - Data.ByteString[.Lazy][.Char8][.UTF8] - Data.Text[.IO] - Prelude

Times: Data.ByteString[.Lazy]: 0.00s Data.ByteString.UTF8: 0.14s Prelude: 0.21s Data.ByteString.Lazy.UTF8: 0.56s Data.Text: 0.66s

Optimized: Length - ByteString: 0.01223 s Length - Lazy ByteString: 0.00328 s Length - String: 0.15474 s Length - UTF8 ByteString: 0.19945 s Length - UTF8 Lazy ByteString: 0.30123 s Length - Text: 0.70438 s Length - Lazy Text: 0.62137 s String seems to be fastest correct Searching - ByteString: 0.04604 s Searching - ByteString: 0.04424 s Searching - String: 0.18178 s Searching - UTF8 ByteString: 0.32606 s Searching - UTF8 Lazy ByteString: 0.42984 s Searching - Text: 0.26599 s Searching - Lazy Text: 0.37320 s While ByteString is clear winner String is actually good compared to others. Searching ą - String: 0.18557 s Searching ą - UTF8 ByteString: 0.32752 s Searching ą - UTF8 Lazy ByteString: 0.43811 s Searching ą - Text: 0.28401 s Searching ą - Lazy Text: 0.37612 String is fastest? Hmmm. Compiled: Length - ByteString: 0.00861 s Length - Lazy ByteString: 0.00409 s Length - String: 0.16059 s Length - UTF8 ByteString: 0.20165 s Length - UTF8 Lazy ByteString: 0.31885 s Length - Text: 0.70891 s Length - Lazy Text: 0.65553 s ByteString is also clear winner but String once again wins in 'correct' section. Searching - ByteString: 1.27414 s Searching - ByteString: 1.27303 s Searching - String: 0.56831 s Searching - UTF8 ByteString: 0.68742 s Searching - UTF8 Lazy ByteString: 0.75883 s Searching - Text: 1.16121 s Searching - Lazy Text: 1.76678 s I mean... what? I may be doing something wrong Searching ą - String: 0.32612 s Searching ą - UTF8 ByteString: 0.41564 s Searching ą - UTF8 Lazy ByteString: 0.52919 s Searching ą - Text: 0.87463 s Searching ą - Lazy Text: 1.52369 s No comment. Intepreted Length - ByteString: 0.00511 s Length - Lazy ByteString: 0.00378 s Length - String: 0.16657 s Length - UTF8 ByteString: 0.21639 s Length - UTF8 Lazy ByteString: 0.33952 s Length - Text: 0.79771 s Length - Lazy Text: 0.65320 s As with others. Searching - ByteString: 9.12051 s Searching - ByteString: 8.94038 s Searching - String: 8.57391 s Searching - UTF8 ByteString: 7.71766 s Searching - UTF8 Lazy ByteString: 7.79422 s Searching - Text: 8.34435 s Searching - Lazy Text: 9.07538 s Now they are pretty much equal. Searching ą - String: 3.17010 s Searching ą - UTF8 ByteString: 3.94399 s Searching ą - UTF8 Lazy ByteString: 3.92382 s Searching ą - Text: 3.32901 s Searching ą - Lazy Text: 4.18038 s Hmm. Still the best? Your test: Optimized Compiled Interpreted ByteString: 0.011 0.011 0.421 ByteString Lazy: 0.006 0.006 0.535 String: 0.237 0.240 0.650 Text: 0.767 0.720 1.192 Text Lazy: 0.661 0.614 1.061 ByteString UTF8: 0.204 0.204 0.631 ByteString Lazy UTF8: 0.386 0.309 0.744 System: Core 2 Duo T9600 2.80 GHz, 2 GiB RAM Gentoo Linux x86-64. Linux 2.6.33 + gentoo patches + ck. Glibc 2.11 GHC 6.12.1 base 4.2.0.0 bytestring 0.9.1.5 text 0.7.1.0 utf8-string 0.3.6 PS. Tests were repeated a few times and each gave similar results.

...

...
- It wasn't written what is the typical case

Aren't there several quite different typical cases? One fairly typical case is big ASCII or latin1 files (e.g. fasta files, numerical data). For those, usually ByteString is by far the best choice.

On the other hand - if you load the numerical data it is likely that: - It will have some labels. The labels can happen to need non-ascii or non-latin elements - Biggest time will be spent on operating on numbers then strings.

...

Another fairly typical case is *text* processing, possibly with text in different scripts (latin, hebrew, kanji, ...). Depending on what you want to do (and the encoding), any of Prelude.String, Data.Text and Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings probably aren't. String and Text also have the advantage that you aren't tied to utf-8.

Choose your datatype according to your problem, not one size fits all.

My measurements seems to prefer String but they are probably wrong. Regards

Daniel Fischer

12 Apr 12 Apr

12:56 a.m.

New subject: When to use ByteString rather than [Char] ... ?

Am Montag 12 April 2010 01:01:36 schrieb Maciej Piechotka:

...

On Sun, 2010-04-11 at 22:07 +0200, Daniel Fischer wrote:

...
Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:

...
Of course: - I haven't done any tests. I guessed (which I written)

I just have done a test. Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no characters outside latin1 range) and the same with ('\n':map toEnum [256 .. 10000] ++ "\n") appended.

Converted myspell polish dictonary (a few % of non-ascii chars) added twice (6531616 bytes).

...

Optimized:

Length - ByteString: 0.01223 s Length - Lazy ByteString: 0.00328 s Length - String: 0.15474 s Length - UTF8 ByteString: 0.19945 s Length - UTF8 Lazy ByteString: 0.30123 s Length - Text: 0.70438 s Length - Lazy Text: 0.62137 s

String seems to be fastest correct

For me, strict UTF8 ByteString was faster than String, but as for you, both played in the same league.

...

Searching - ByteString: 0.04604 s Searching - ByteString: 0.04424 s Searching - String: 0.18178 s Searching - UTF8 ByteString: 0.32606 s Searching - UTF8 Lazy ByteString: 0.42984 s Searching - Text: 0.26599 s Searching - Lazy Text: 0.37320 s

While ByteString is clear winner String is actually good compared to others.

The ByteStrings should be faster if you use count instead of foldl' find 0. Anyway, where applicable, ByteStrings are much faster for certain tasks - I suspect they wouldn't do so well if one had a (map function . filter predicate) on the input. I'm surprised here that a) searching takes so much longer than calculating the length for BS.UTF8 b) searching is *much faster* than calculating the length for Text c) both, UTF8 BS and Text, are slower than String

...

Searching ą - String: 0.18557 s Searching ą - UTF8 ByteString: 0.32752 s Searching ą - UTF8 Lazy ByteString: 0.43811 s Searching ą - Text: 0.28401 s Searching ą - Lazy Text: 0.37612

String is fastest? Hmmm.

Not much difference to the previous.

...

Compiled:

Compiled means no optimisations? That's a *bad* idea for ByteStrings and Text. You only get the fusion magic with optimisations turned on, without, well...

...

Length - ByteString: 0.00861 s Length - Lazy ByteString: 0.00409 s Length - String: 0.16059 s Length - UTF8 ByteString: 0.20165 s Length - UTF8 Lazy ByteString: 0.31885 s Length - Text: 0.70891 s Length - Lazy Text: 0.65553 s

ByteString is also clear winner but String once again wins in 'correct' section.

Searching - ByteString: 1.27414 s Searching - ByteString: 1.27303 s Searching - String: 0.56831 s Searching - UTF8 ByteString: 0.68742 s Searching - UTF8 Lazy ByteString: 0.75883 s Searching - Text: 1.16121 s Searching - Lazy Text: 1.76678 s

I mean... what? I may be doing something wrong

Yes, using ByteString and Text without optimisations :)

...

PS. Tests were repeated a few times and each gave similar results.

...
...
- It wasn't written what is the typical case

Aren't there several quite different typical cases? One fairly typical case is big ASCII or latin1 files (e.g. fasta files, numerical data). For those, usually ByteString is by far the best choice.

On the other hand - if you load the numerical data it is likely that: - It will have some labels. The labels can happen to need non-ascii or non-latin elements

Possible. But label-free formats of n columns of numbers are fairly common (and easier to handle).

...

- Biggest time will be spent on operating on numbers then strings.

In that case, it is of course less important which type you use for IO.

...

...
Another fairly typical case is *text* processing, possibly with text in different scripts (latin, hebrew, kanji, ...). Depending on what you want to do (and the encoding), any of Prelude.String, Data.Text and Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings probably aren't. String and Text also have the advantage that you aren't tied to utf-8.

Choose your datatype according to your problem, not one size fits all.

My measurements seems to prefer String but they are probably wrong.

Yes, measurements always are wrong ;) More seriously, you measured a couple of tasks on one system. For different tasks and other systems, different results should be expected. The best choice depends on the task. For SPOJ problems, ByteStrings are what you'll want. For text processing, probably not. If you want to find a pattern in a long ASCII string, however, they likely are.

...

Regards

Cheers

Felipe Lessa

11 Apr 11 Apr

3:08 p.m.

New subject: When to use ByteString rather than [Char] ... ?

On Sun, Apr 11, 2010 at 12:07:34PM +0100, James Fisher wrote:

...

use the Data.ByteString library rather than standard [Char]

These are different. IN THE PAST: - We used String for everything, both strings and binary data. IN RECENT PAST: - We used String for treating... strings of characters. - We used ByteString for binary data. - To read an UTF-8 string we used a package like utf8-string: 1) Read file as ByteString. 2) Convert UTF-8 bytes into String, a list of Chars. TODAY: - ByteString is used for binary data. - String is used for text when performance isn't critical. - Data.Text (from package 'text') is used for text when time and/or space efficiency is needed. Data.Text uses the same 'tricks' as ByteString, but while the latter encodes bytes, the former encodes Char's. HTH, -- Felipe.

5577

Age (days ago)

5578

Last active (days ago)

List overview

Download

13 comments

6 participants

participants (6)

Daniel Fischer
Felipe Lessa
James Fisher
Maciej Piechotka
Mukhamed Karanashev
Stephen Tetley