
Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
On Sun, 2010-04-11 at 17:17 +0200, Daniel Fischer wrote:
I *guess* that in most cases the overhead on I/O will be
sufficiently
great to make the difference insignificant. However:
? which difference?
I meant: difference between ByteString-IO and [Char]-IO or which difference?
Try reading large files.
Well - while large files are not not-important IIRC most files are small (< 4 KiB) - at least on *nix file systems (at least that's the core 'idea' of reiserfs/reiser4 filesystems).
Well, sometimes one has to process large files even though most are small. If the processing itself is simple, IO-speed is important then.
I guess that for large strings something like text (I think I mentioned it) is better
Unless you know you only have to deal with one-byte characters, when plain ByteStrings are the simplest and fastest method. But those are special cases, in general I agree.
Count the lines or something else, as long as it's simple. The speed difference between ByteString-IO and [Char]-IO is enormous. When you do something more complicated the difference in IO-speed may become insignificant.
Hmm. As newline is a single-byte character in most encodings it is believable.
You can measure it yourself :) cat-ing together a few copies of /usr/share/dict/words should give a large enough file.
However what is the difference in counting chars (not bytes - chars)? I wouldn't be surprise is difference was smaller.
Nor would I. In fact I'd be surprised if it wasn't smaller. [see below] This example was meant to illustrate the difference in IO-speed, so an extremely simple processing was appropriate. The combination of doing IO and processing is something different. If you're doing complicated things, IO time has a good chance to become negligible.
Of course: - I haven't done any tests. I guessed (which I written)
I just have done a test. Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no characters outside latin1 range) and the same with ('\n':map toEnum [256 .. 10000] ++ "\n") appended. Code: main = A.readFile "big.txt" >>= print . B.length where (A,B) is a suitable combination of - Data.ByteString[.Lazy][.Char8][.UTF8] - Data.Text[.IO] - Prelude Times: Data.ByteString[.Lazy]: 0.00s Data.ByteString.UTF8: 0.14s Prelude: 0.21s Data.ByteString.Lazy.UTF8: 0.56s Data.Text: 0.66s Of course Data.ByteString didn't count characters but bytes, so for the modified file, those printed larger numbers than the others (well, it's BYTEString, isn't it?). It's a little unfair, though, as the ByteString[.Lazy] variants don't need to look at each individual byte, so I also let them and Prelude.String count newlines to see how fast they can inspect each character/byte, BS[.Lazy]: 0.02s Prelude: 0.23s both take 0.02s to inspect each item. To summarise: * ByteString-IO is blazingly fast, since all it has to do is get a sequence of bytes from disk into memory. * [Char]-IO is much slower because it has to transform the sequence of bytes to individual characters as they come. * counting utf-8 encoded characters in a ByteString is - unsurprisingly - slow. I'm a bit surprised *how* slow it is for lazy ByteStrings. (Caveat: I've no idea whether Data.ByteString.UTF8 would suffer from more multi-byte characters to the point where String becomes faster. My guess is no, not for single traversal. For multiple traversal, String has to identify each individual character only once, while BS.UTF8 must do it each time, so then String may be faster.) * Data.Text isn't very fast for that one.
- It wasn't written what is the typical case
Aren't there several quite different typical cases? One fairly typical case is big ASCII or latin1 files (e.g. fasta files, numerical data). For those, usually ByteString is by far the best choice. Another fairly typical case is *text* processing, possibly with text in different scripts (latin, hebrew, kanji, ...). Depending on what you want to do (and the encoding), any of Prelude.String, Data.Text and Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings probably aren't. String and Text also have the advantage that you aren't tied to utf-8. Choose your datatype according to your problem, not one size fits all.
- What is 'significant' difference
Depends of course. For a task performed once, who cares whether it takes one second or three? One hour or three, however, is a significant difference (assuming approximately equal times to write the code). Sometimes 10% difference in performance is important, sometimes a factor of 10 isn't. The point is that you should be aware of the performance differences when making your choice.
Regards
Cheers, Daniel