Bytestrings vs String?

Marc Weber

2 Feb 2009 2 Feb '09

10:45 p.m.

A lot of people are suggesting using Bytestrings for performance, strictness whatsoever reasons. However how well do they talk to other libraries? One I've in mind is hslogger right now. Should hslogger be implemented using Strings or Bytestrings ? Should there be two versions? hslogger-bytestring and hslogger-string? Or would it be better to implement one String class which can cope with everthing (performance will drop, won't it?) I feel it would make sense to talk about how to provide this? In the future I'd like to explore using haskell for web developement. So speed does matter. And I don't want my server to convert from Bytestrings to Strings and back multiple times.. So is the best we could do compile the same library twice using different flags ? One providing a Bytestring API, the other using Strings? Cluttering up code by from to Bytestring conversions doesn't look compelling to me. Thoughts? Marc Weber

Show replies by date

John Goerzen

2 Feb 2 Feb

10:58 p.m.

Marc Weber wrote:

...

A lot of people are suggesting using Bytestrings for performance, strictness whatsoever reasons.

However how well do they talk to other libraries?

One I've in mind is hslogger right now.

Should hslogger be implemented using Strings or Bytestrings ?

Should there be two versions?

hslogger-bytestring and hslogger-string?

Or would it be better to implement one String class which can cope with everthing (performance will drop, won't it?)

Not necessarily. hslogger could easily accept both, and deal with it appropriately under the hood. If whatever it prefers under the hood is supplied to it, I don't see any reason that performance would suffer. I very much suspect though that you would not likely see a measurable performance difference either way with hslogger in real-world situations. -- John

wren ng thornton

3 Feb 3 Feb

3:41 a.m.

Marc Weber wrote:

...

A lot of people are suggesting using Bytestrings for performance, strictness whatsoever reasons.

However how well do they talk to other libraries?

I'm not sure how you mean? For passing them around: If someone's trying to combine your library (version using ByteStrings) and another Haskell library that uses ByteStrings, then everything works fine--- assuming both libraries are compiled against the same version of the bytestring library. As I recall, ByteStrings are designed to ease passing to C code across the FFI too, in case someone wants to use your library with some FFI C code. If someone's trying to combine your library with another library that uses String, they'll need to add conversions. (All of this is symmetric for a version of your library using String with another library using ByteStrings.) The big compatibility issue I can see is the question of what a given ByteString *means*. In particular, via the Data.ByteString.Char8 module it encodes only ASCII characters, not all of Unicode like [Char] does. There are libraries for lossless encoding of [Char] into ByteStrings, but in general there can be encoding mismatch problems if, say, your library uses UTF8-encoded ByteStrings but the other library treats them like Char8-encoded (or UTF16BE, UTF16LE, FooBar,...), potentially mangling or hallucinating multi-byte characters. In general, if you're concerned about performance (or believe your users will be) then ByteStrings are a good bet. Just make it clear in the documentation what sort of encoding you use (or whether your library is encoding agnostic). For hslogger specifically, it looks like most of the Strings are arguments which will typically be written as literals. Thus, to minimize boilerplate, if you do switch to ByteStrings then you may want to provide a module that does all the String->ByteString conversions for the user. If you have a good program for testing real world use of hslogger, before committing to the change I'd suggest benchmarking (in time and in space) the differences between the current String implementation and a proposed ByteString implementation.

...

Should there be two versions?

hslogger-bytestring and hslogger-string?

I'd just stick with one (with a module for hiding the conversions, as desired). Duplicating the code introduces too much room for maintenance and compatibility issues.

...

Or would it be better to implement one String class which can cope with everthing (performance will drop, won't it?)

It'd be a very large class if you do it generally[1], and large classes like that are generally frowned on (for good or ill). If you only need a small subset of string operations then it may be more feasible to have a smaller class with only those operations. [1] See everything hidden from the Prelude in http://hackage.haskell.org/packages/archive/list-extras/0.2.2.1/doc/html/src... or see what all is offered by Data.ByteString vs the Prelude.

...

In the future I'd like to explore using haskell for web developement. So speed does matter. And I don't want my server to convert from Bytestrings to Strings and back multiple times..

That's the big thing. The more people that use ByteStrings the less need there is to convert when combining libraries. That said, ByteStrings aren't a panacea; lists and laziness are very useful. -- Live well, ~wren

Marc Weber

9:50 a.m.

New subject: Bytestrings vs String? parameters within package names?

On Mon, Feb 02, 2009 at 10:41:57PM -0500, wren ng thornton wrote:

...

Marc Weber wrote:

...
Should there be two versions? hslogger-bytestring and hslogger-string?

I'd just stick with one (with a module for hiding the conversions, as desired). Duplicating the code introduces too much room for maintenance and compatibility issues.

That's the big thing. The more people that use ByteStrings the less need there is to convert when combining libraries. That said, ByteStrings aren't a panacea; lists and laziness are very useful.

Hi wren, In the second paragraph you agree that there will be less onversion when using only one type of strings. You're also right about encoding. About laziness you'r partially right: There is also Bytestring.Lazy which is a basically a list of (non lazy) Bytestring

...

Duplicating the code introduces too much room for maintenance and compatibility issues.

I didn't mean duplicating the whole library. I was thinking about a cabal flag the cabal file: flag bytestring Default: False Description: enable this to use Bytestrings everywhere instead of strings [... now libs and executables: ...] if flag(bytestring) cpp-options: -DUSE_BYTESTRING An example module module Example where #ifdef Strings import Data.List as S #endif #ifdef USE_BYTESTRINGS import Data.ByteString as S #endif #ifdef USE_LAZY_BYTESTRINGS import Data.ByteString.LAZY as S #endif #ifdef USE_UNICODE_BYTESTRING_LIKE_STRINGS -- two bytes per char or more? -- they can also be lazy such as Strings however one array element can -- have more than one byte import Data.Vector as S #endif Of course all four modules import Data.List as S import Data.ByteString as S import Data.ByteString.LAZY as S import Data.Vector as S must expose the same API.. Of course cluttering up all files using those ifdefs isn't a nice option either. But one could move this selection into the cabal file either depending on one of those (no yet existing) packages: string-string string-bytestring string-utf8-bytestring string-bytestring string-bytestring-lazy Then you could replace one implementation by the other and recompile and see wether the results differ. Of course we must take care that we can keep laziness if required. However using different packages exposing the same API (same modules and same name will cause trouble if you really have to use both implementations at some time. I only konw that there has been some discussion about how to tell ghc to use a module from a particual package. ..) So I'd like to propose another way: {-# LANGUAGE CPP #-} import Data.STRING as S and tell .cabal to define STRING representing either of the different string implementations. I think this would be most portable and you can additionally import other String modules as well. So for now I think it would be best if you could teach cabal to change names depending on flags: Name: hslogger-${STRING_TYPE} flag: use_strings set STRING_TYPE = String flag: use_bytestrings set STRING_TYPE = Bytestring ..... Don't think about this issue how it is now or how much effort it would be to rewrite everything. Think about it how you'd like to work using haskell in about a year. Marc

John Goerzen

3:41 p.m.

New subject: Bytestrings vs String? parameters within package names?

Marc Weber wrote:

...

On Mon, Feb 02, 2009 at 10:41:57PM -0500, wren ng thornton wrote:

...
Marc Weber wrote:

...
Should there be two versions? hslogger-bytestring and hslogger-string? I'd just stick with one (with a module for hiding the conversions, as desired). Duplicating the code introduces too much room for maintenance and compatibility issues.

That's the big thing. The more people that use ByteStrings the less need there is to convert when combining libraries. That said, ByteStrings aren't a panacea; lists and laziness are very useful.

Hi wren,

In the second paragraph you agree that there will be less onversion when using only one type of strings.

Incidentally, I already wrote a library that abstracts the difference between a String and a ByteString: ListLike. I don't think anybody, including me, even uses it now. Turns out that's not all that helpful an abstraction to make ;-) -- John

wren ng thornton

9:12 p.m.

New subject: Bytestrings vs String? parameters within package names?

Marc Weber wrote:

...

wren ng thornton wrote:

...
I'd just stick with one (with a module for hiding the conversions, as desired). Duplicating the code introduces too much room for maintenance and compatibility issues.

That's the big thing. The more people that use ByteStrings the less need there is to convert when combining libraries. That said, ByteStrings aren't a panacea; lists and laziness are very useful.

Hi wren,

In the second paragraph you agree that there will be less onversion when using only one type of strings.

You're also right about encoding. About laziness you'r partially right: There is also Bytestring.Lazy which is a basically a list of (non lazy) Bytestring

Sure, but lazy bytestrings are still chunk-wise strict. Sometimes even that isn't lazy enough (more often with non-string kinds of lists, granted). -- Live well, ~wren

Brandon S. Allbery KF8NH

3:46 a.m.

On 2009 Feb 2, at 17:45, Marc Weber wrote:

...

Or would it be better to implement one String class which can cope with everthing (performance will drop, won't it?)

There is already an IsString class, although IIRC it deals solely with converting String to/from a member of IsString. Most of the rest involves invoking functions as S.whatever and then importing some variant of Data.ByteString or Data.List qualified as S. (The latter is kinda unfortunate, but Data.String was taken to provide the IsString class. Code would look a bit saner if Data.String reexported Data.List for this usage model; note that this model only makes sense with IsString, so it would be a hopefuly backward compatible addition.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

6010

Age (days ago)

6011

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Brandon S. Allbery KF8NH
John Goerzen
Marc Weber
wren ng thornton