
On Sun, 2010-04-11 at 12:07 +0100, James Fisher wrote:
Hi,
After working through a few Haskell tutorials, I've come across numerous recommendations to use the Data.ByteString library rather than standard [Char], for reasons of "performance". I'm having trouble swallowing this -- presumably the standard String is default for good reasons. Nothing has answered this question: in what case is it better to use [Char]?
In most cases you need an actuall String and it is not time-critical I believe. ByteString is... well string of bytes not char - you have no idea whether they are encoded as utf-8, ucs-2, ascii, iso-8859-1 (or as jpeg ;) ). If you want the next char you don't know how many bytes you need to read (1? 2? 3? depends on contents?). String ([Char]) have defined representation - while read/write function might incorrect encode/decode it (up to GHC 6.12 System.IO had assumes ascii encoding IIRC on read) it is their error.
Could anyone point me to a good resource showing the differences between how [Char] and ByteString are implemented, and giving good a heuristic for me to decide which is better in any one case?
ByteString is pointer with offset and length. Lazy ByteString is a linked list of ByteStrings (with additional condition that none of inner ByteStrings are empty). In theory String is [Char] i.e. [a] i.e. data [a] = [] | a:[a] In other words it is linked list of characters. That, for long strings, may be inefficient (because of cache, O(n) on random access and necessity of checking for errors while evaluating further[1]). I heard somewhere that actual implementations optimizes it to arrays when it is possible (i.e. can be detected and does not messes with non-strict semantics). However I don't know if it is true. I *guess* that in most cases the overhead on I/O will be sufficiently great to make the difference insignificant. However: - If you need exact byte representation - for example for compression, digital signatures etc. you need ByteString - If you need to operate on text rather then bytes use String or specialized data structures as Data.Text & co. - If you don't care about performance and need easy of use (pattern matching etc.) use String. - If you have no special requirements than you can ByteString While some languages (for example C, Python, Ruby) mixes the text and it's representation I guess it is not always the best way. String in such separation is an text while ByteString is a binary representation of something (can be text, picture, compresses data etc.).
Best,
James Fisher
Regards [1] However the O(n) access time and checking of errors are still introduced by decoding string. So if you need UTF-8 you will still get the O(n) access time ;)