The question of ByteString

Somewhat related to the discussions about Haskell's performance... String. ByteString. Do we really need both? Can one replace the other? Why is one faster? Can't we make *all* lists this fast? [insert further variations here] Thoughts?

On 11/2/07, Andrew Coppin
Somewhat related to the discussions about Haskell's performance...
String. ByteString. Do we really need both? Can one replace the other?
You can't get rid of "String" because a String is just a [Char]. Requiring the element type of a list to be "anything except Char" would be silly. In addition, it's useful to have a String that you can apply arbitrary list operations to when performance isn't a concern (i.e., most of the time). Finally, removing String would break existing code.
Why is one faster? Can't we make *all* lists this fast? [insert further variations here]
ByteString takes advantage of the fact that the elements are, well, bytes. The operations are optimized for reading large amounts of text, but not necessarily for other applications. Lists are a parameterized type, so the elements of a list are pointers to arbitrary data. So that's why the same tricks as ByteString don't apply to general lists. That isn't to say that there aren't possible optimizations which haven't yet been dreamed of. Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "It is a defect of God's humour that he directs our hearts everywhere but to those who have a right to them." -- Tom Stoppard

Tim Chevalier wrote:
On 11/2/07, Andrew Coppin
wrote: Somewhat related to the discussions about Haskell's performance...
String. ByteString. Do we really need both? Can one replace the other?
You can't get rid of "String" because a String is just a [Char]. Requiring the element type of a list to be "anything except Char" would be silly. In addition, it's useful to have a String that you can apply arbitrary list operations to when performance isn't a concern (i.e., most of the time). Finally, removing String would break existing code.
Why is one faster? Can't we make *all* lists this fast? [insert further variations here]
ByteString takes advantage of the fact that the elements are, well, bytes. The operations are optimized for reading large amounts of text, but not necessarily for other applications. Lists are a parameterized type, so the elements of a list are pointers to arbitrary data. So that's why the same tricks as ByteString don't apply to general lists. That isn't to say that there aren't possible optimizations which haven't yet been dreamed of.
Well OK, maybe I was a little vague. Let me be a bit more specific... If you do text processing using ByteString rather than String, you get dramatically better performance in time and space. For me, this raises a number of questions: 1. Why do I have to type "ByteString" in my code? Why isn't the compiler automatically performing this optimisation for me? (I.e., is there some observable property that is changed? Currently the answer is yes: the ByteString interface only provides "trancated" Unicode characters. But, in principle, that could be changed.) 2. ByteString makes text strings faster. But what about other kinds of collections? Can't we do something similar to them that makes them go faster? As I understand it, ByteString is faster due to several factors. First of all, it's stricter. Secondly, it's an "unboxed" structure (so you eliminate layers of indirection and there's less GC load). Third, it's implemented as an array that looks like a linked list. Given how ubiquitous lists are in Haskell, "array that looks like a linked list" sounds like one seriously useful data type! Yet ByteString seems to be the only implementation of this concept - and only for lists on unboxed bytes. (Not even unboxed Word16 or anything else, *only* Word8.) If I understand this correctly, a ByteString is actually a linked list of large array chunks. (This presumably yields fastER random access than a plain linked list?) Also, it seems to be possible to create a new array which is merely a subrange of an existing one, without any copying; the standard array API doesn't seem to provide this, yet it sounds damn useful. These are the things I'm thinking about. Is there some deep theoretical reason why things are the way they are? Or is it merely that nobody has yet had time to make something better? ByteString solves the problem of text strings (and raw binary data) very nicely, it's just a pitty we can't apply some of that know-how more widely...

On 11/2/07, Andrew Coppin
1. Why do I have to type "ByteString" in my code? Why isn't the compiler automatically performing this optimisation for me? (I.e., is there some observable property that is changed? Currently the answer is yes: the ByteString interface only provides "trancated" Unicode characters. But, in principle, that could be changed.)
That's an interesting question; one property the compiler would have to prove in order to replace a String with a ByteString is that the given String will be fully evaluated strictly (e.g., not just that it will be evaluated to WHNF, but that each of its elements will be demanded). Currently, GHC's demand analyzer doesn't look inside lists (it only looks inside things that have product types), but it's not unimaginable. It would be a matter of whether the potential payoff justifies the complexity of the imagined analysis, as always.
2. ByteString makes text strings faster. But what about other kinds of collections? Can't we do something similar to them that makes them go faster?
As I understand it, ByteString is faster due to several factors. First of all, it's stricter. Secondly, it's an "unboxed" structure (so you eliminate layers of indirection and there's less GC load). Third, it's implemented as an array that looks like a linked list. Given how ubiquitous lists are in Haskell, "array that looks like a linked list" sounds like one seriously useful data type! Yet ByteString seems to be the only implementation of this concept - and only for lists on unboxed bytes. (Not even unboxed Word16 or anything else, *only* Word8.) If I understand this correctly, a ByteString is actually a linked list of large array chunks. (This presumably yields fastER random access than a plain linked list?) Also, it seems to be possible to create a new array which is merely a subrange of an existing one, without any copying; the standard array API doesn't seem to provide this, yet it sounds damn useful.
These are the things I'm thinking about. Is there some deep theoretical reason why things are the way they are? Or is it merely that nobody has yet had time to make something better? ByteString solves the problem of text strings (and raw binary data) very nicely, it's just a pitty we can't apply some of that know-how more widely...
I don't think there's a deep theoretical reason why this doesn't exist, but I also don't think it's necessarily *just* a matter of no one having had time yet. As always, there are trade-offs involved, and people try to avoid introducing *too* many special cases into the compiler. Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "I give fellowship advice, not relationship advice." -- Michael Sacramento

Tim Chevalier wrote:
I don't think there's a deep theoretical reason why this doesn't exist, but I also don't think it's necessarily *just* a matter of no one having had time yet. As always, there are trade-offs involved, and people try to avoid introducing *too* many special cases into the compiler.
Well, that sounds like a reasonable answer. ByteString is a special case. I'm just wondering how much of it is really specific to the string / binary processing case, and how much of it will generalise to other useful places. ;-) (Mathematicians like to ask "interesting" questions. Unfortunately, at least in mathematics, "interesting" tends to correlate with "this may take several human lifetimes to solve"...)

Andrew Coppin wrote:
1. Why do I have to type "ByteString" in my code? Why isn't the compiler automatically performing this optimisation for me?
One reason is that ByteString is stricter than String. Even lazy ByteString operates on 64KB chunks. You can see how this might lead to problems with a String like this: "foo" ++ undefined The first three elements of this list are well-defined, but if you touch the fourth, you die.
2. ByteString makes text strings faster. But what about other kinds of collections? Can't we do something similar to them that makes them go faster?
Not as easily. The big wins with ByteString are, as you observe, that the data are tiny, uniformly sized, and easily unboxed (though using ForeignPtr seems to be a significant win compared to UArray, too). This also applies to other basic types like Int and Double, but leave those behind, and you get problems. If your type is an instance of Storable, it's going to have a uniform size, but it might be expensive to flatten and unflatten it, so who knows whether or not it's truly beneficial. If it's not an instance of Storable, you have to store an array of boxed values, and we know that arrays of boxes have crummy locality of reference. Spencer Janssen hacked up the ByteString code to produce StorableVector as part of last year's SoC, but it never got finished off: http://darcs.haskell.org/SoC/fps-soc/Data/StorableVector/ More recently, we've been pinning our hopes on the new list fusion stuff to give many of the locality of reference benefits of StorableVector with fewer restrictions, and all the heavy work done in a library.

On Nov 2, 2007, at 17:35 , Andrew Coppin wrote:
These are the things I'm thinking about. Is there some deep theoretical reason why things are the way they are? Or is it merely that nobody has yet had time to make something better? ByteString solves the problem of text strings (and raw binary data) very nicely, it's just a pitty we can't apply some of that know-how more widely...
I'm under the impression that several of the things that make ByteString faster (e.g. smarter fusion) are either implemented within (newer) GHC already, such that other list-like types can take advantage of them directly, or being worked on as part of a new Data.Stream module that anyone can use with their own types instead of simple lists. So in the long run it *won't* just be ByteString; that's just what's driving the development. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Fri, 2007-11-02 at 21:35 +0000, Andrew Coppin wrote:
Well OK, maybe I was a little vague. Let me be a bit more specific...
If you do text processing using ByteString rather than String, you get dramatically better performance in time and space. For me, this raises a number of questions:
1. Why do I have to type "ByteString" in my code? Why isn't the compiler automatically performing this optimisation for me? (I.e., is there some observable property that is changed?
Yes, the semantics are different. ByteString is stricter. In some circumstances you could discover that some list is being used sufficiently strictly (spine and element strict) that you could do a representation change to use strict arrays. It is something I have pondered occasionally and I think that is an interesting avenue for research. One approach might be to do a more sophisticated strictness analysis earlier in the compilation process; one that gives details on strictness of substructure, ie the tail/element strictness in lists. Then if this strictness information were available to the rule matching then we might be able to write rules that change certain functions to work on optimised data representations. However this is likely to be quite fragile. I usually think that it's better to declare the strictness you want up front in one place, and have that be propagated, rather than doing the reverse of inferring that something could be stricter from all the use sites. Strictness annotations on data constructors are a good example of this.
Currently the answer is yes: the ByteString interface only provides "trancated" Unicode characters. But, in principle, that could be changed.)
Indeed it could, we could provide a proper Unicode string type.
2. ByteString makes text strings faster. But what about other kinds of collections? Can't we do something similar to them that makes them go faster?
There is much less benefit for other collections since the overheads of generic structures are smaller for other types. Note that the NDP parallel arrays stuff uses type functions to calculate optimised data representations for arrays of types.
As I understand it, ByteString is faster due to several factors. First of all, it's stricter.
Do that's the semantic difference.
Secondly, it's an "unboxed" structure (so you eliminate layers of indirection and there's less GC load).
Which is the representation optimisation allowed by the semantic change of making it stricter.
Third, it's implemented as an array that looks like a linked list. Given how ubiquitous lists are in Haskell, "array that looks like a linked list" sounds like one seriously useful data type! Yet ByteString seems to be the only implementation of this concept - and only for lists on unboxed bytes. (Not even unboxed Word16 or anything else, *only* Word8.) If I understand this correctly, a ByteString is actually a linked list of large array chunks. (This presumably yields fastER random access than a plain linked list?) Also, it seems to be possible to create a new array which is merely a subrange of an existing one, without any copying; the standard array API doesn't seem to provide this, yet it sounds damn useful.
I think the NDP project should get us most of this stuff actually. Duncan

Duncan Coutts wrote:
On Fri, 2007-11-02 at 21:35 +0000, Andrew Coppin wrote:
Well OK, maybe I was a little vague. Let me be a bit more specific...
If you do text processing using ByteString rather than String, you get dramatically better performance in time and space. For me, this raises a number of questions:
1. Why do I have to type "ByteString" in my code? Why isn't the compiler automatically performing this optimisation for me? (I.e., is there some observable property that is changed?
Yes, the semantics are different. ByteString is stricter. In some circumstances you could discover that some list is being used sufficiently strictly (spine and element strict) that you could do a representation change to use strict arrays. It is something I have pondered occasionally and I think that is an interesting avenue for research.
OK. So I'm not the first person to wonder about this then... ;-)
One approach might be to do a more sophisticated strictness analysis
However this is likely to be quite fragile. I usually think that it's better to declare the strictness you want up front in one place
Yeah, I can imagine. And I guess then you'd be forever wondering "hey, did the compiler do that optimisation or not?" like with certain current optimisations now. (E.g., did that type get unboxed?)
Currently the answer is yes: the ByteString interface only provides "trancated" Unicode characters. But, in principle, that could be changed.)
Indeed it could, we could provide a proper Unicode string type.
Likely to happen in the near term? (Not that I imagine this is a huge issue for anybody...)
2. ByteString makes text strings faster. But what about other kinds of collections? Can't we do something similar to them that makes them go faster?
There is much less benefit for other collections since the overheads of generic structures are smaller for other types.
Note that the NDP parallel arrays stuff uses type functions to calculate optimised data representations for arrays of types.
Type... functions...? That sounds pretty scary. :-} I was thinking more immediately about lists in general, but also perhaps binary trees and things. (Although it's probably hard to optimise a general binary tree; you would probably optimise a tree designed for a specific *purpose* instead...)
As I understand it, ByteString is faster due to several factors. First of all, it's stricter.
So that's the semantic difference.
Secondly, it's an "unboxed" structure.
Which is the representation optimisation allowed by the semantic change of making it stricter.
Indeed.
Third, it's implemented as an array that looks like a linked list. Given how ubiquitous lists are in Haskell, "array that looks like a linked list" sounds like one seriously useful data type! Yet ByteString seems to be the only implementation of this concept - and only for lists on unboxed bytes. (Not even unboxed Word16 or anything else, *only* Word8.) If I understand this correctly, a ByteString is actually a linked list of large array chunks. (This presumably yields fastER random access than a plain linked list?) Also, it seems to be possible to create a new array which is merely a subrange of an existing one, without any copying; the standard array API doesn't seem to provide this, yet it sounds damn useful.
I think the NDP project should get us most of this stuff actually.
Cool. So in summary, these kinds of things are "on the way", they're just not here quite yet? (BTW, anybody have any clue what's happening with stream fusion? I remember reading the paper saying "hey, this is how it works and it's cool and we're going to try to replace the whole list library with new stream implementations", but that's the last I heard...)

Duncan Coutts wrote:
Yes, the semantics are different. ByteString is stricter. In some circumstances you could discover that some list is being used sufficiently strictly (spine and element strict) that you could do a representation change to use strict arrays. It is something I have pondered occasionally and I think that is an interesting avenue for research. Also important is the fact that String = [Char], and a Char can hold any Unicode character, whereas a ByteString is a sequence of Word8 elements (i.e. integers from 0-255). If you want to store a Char in a ByteString
Andrew Coppin wrote: then you have to convert it to UTF-8 or something similar. The Encoding package (http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding-0.2) has the functions for this kind of thing. Paul.

On Nov 3, 2007, at 5:34 , Andrew Coppin wrote:
(BTW, anybody have any clue what's happening with stream fusion? I remember reading the paper saying "hey, this is how it works and it's cool and we're going to try to replace the whole list library with new stream implementations", but that's the last I heard...)
dons and I have both mentioned Data.Stream recently; I think it's on hackage. It's still a work in progress though, I believe. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Brandon S. Allbery KF8NH wrote:
On Nov 3, 2007, at 5:34 , Andrew Coppin wrote:
(BTW, anybody have any clue what's happening with stream fusion? I remember reading the paper saying "hey, this is how it works and it's cool and we're going to try to replace the whole list library with new stream implementations", but that's the last I heard...)
dons and I have both mentioned Data.Stream recently; I think it's on hackage. It's still a work in progress though, I believe.
ITYM Data.List.Stream, which is at http://www.cse.unsw.edu.au/~dons/streams.html Data.Stream, on hackage, is a package for infinite lists, i.e. streams. Jules
participants (7)
-
Andrew Coppin
-
Brandon S. Allbery KF8NH
-
Bryan O'Sullivan
-
Duncan Coutts
-
Jules Bean
-
Paul Johnson
-
Tim Chevalier