Motion to unify all the string data types

Frequently when I'm coding in Haskell, the crux of my problem is converting between all the stupid string formats. You've got String, ByteString, Lazy ByteString, Text, [Word], and on and on... I have to constantly lookup how to convert between them, and the overloaded strings GHC directive doesn't work, and sometimes ByteString.unpack doesn't work, because it expects [Word8], not [Char]. AAAAAAAAAAAAAAAAAAAH!!! Haskell is a wonderful playground for experimentation. I've started to notice that many Hackage libraries are simply instances of typeclasses designed a while ago, and their underlying implementations are free to play around with various optimizations... But they ideally all expose the same interface through typeclasses. Can we do the same with String? Can we pick a good compromise of lazy vs strict, flexible vs fast, and all use the same data structure? My vote is for type String = [Char], but I'm willing to switch to another data structure, just as long as it's consistently used. -- Cheers, Andrew Pennebaker www.yellosoft.us

Hi Andrew, On Fri, Nov 9, 2012 at 6:15 PM, Andrew Pennebaker < andrew.pennebaker@gmail.com> wrote:
Frequently when I'm coding in Haskell, the crux of my problem is converting between all the stupid string formats.
You've got String, ByteString, Lazy ByteString, Text, [Word], and on and on... I have to constantly lookup how to convert between them, and the overloaded strings GHC directive doesn't work, and sometimes ByteString.unpack doesn't work, because it expects [Word8], not [Char]. AAAAAAAAAAAAAAAAAAAH!!!
Haskell is a wonderful playground for experimentation. I've started to notice that many Hackage libraries are simply instances of typeclasses designed a while ago, and their underlying implementations are free to play around with various optimizations... But they ideally all expose the same interface through typeclasses.
Can we do the same with String? Can we pick a good compromise of lazy vs strict, flexible vs fast, and all use the same data structure? My vote is for type String = [Char], but I'm willing to switch to another data structure, just as long as it's consistently used.
tl;dr; Use strict Text and ByteStrings. We need at least two string types, one for byte strings and one for Unicode strings, as these are two semantically different concepts. You see that most modern languages use two types (e.g. str and unicode in Python). For Unicode strings, String is not a good candidate; it's slow, uses a lot of memory, doesn't hide its representation [1], and finally, it encourages people to do the wrong thing from a Unicode perspective [2]. As a community we should primary use strict ByteStrings and Texts. There are uses for the lazy variants (i.e. they are sometimes more efficient), but in general the strict versions should be preferred. Choosing to use these two types can sometimes be a bit frustrating, as lots of code (e.g. the base package) uses Strings. But if we don't start using them the pain will never end. One of the main pain points is that the I/O layer using Strings, which is both inconvenient and wrong (e.g. a socket returns bytes, not Unicode code points, yet the recv function returns a String). We really need to create a more sane I/O layer. If you use ByteString and Text, you shouldn't see calls to pack/unpack in your code (except if you want to interact with legacy code), as the correct way to go between the two is via the encode and decode functions in the text package. As for type classes, I don't think we use them enough. Perhaps because Haskell wasn't developed as an engineering language, some good software engineering principles (code against an interface, not a concrete implementation) aren't used in out base libraries. One specific example is the lack of a sequence abstraction/type class, that all the string, list, and vector types could implement. Right now all these types try to implement a compatible interface (i.e. the traditional list interface), without a language mechanism to express that this is what they do. 1. If String was designed as an abstract type, we could simply has switched its implementation for a more efficient implementation and we would have to create a new Text type. 2. By having the primary interface of a Unicode data type be a sequence, we encourage users to work on strings element-wise, which can lead to errors as Unicode code points don't correspond well to the human concept of a character (for example, the Swedish ä character can be represented using either one or two code points). A sequence view is sometimes useful, if you're implementing more high-level transformations, but often you should use functions that operate on the whole string, such as toUpper :: Text -> Text. Cheers, Johan

* Johan Tibell
As a community we should primary use strict ByteStrings and Texts. There are uses for the lazy variants (i.e. they are sometimes more efficient), but in general the strict versions should be preferred.
I'm fairly surprised by this advice. I think that lazy BS/Text are a much safer default. If there's not much text it wouldn't matter anyway, but for large amounts using strict BS/Text would disable incremental producing/consuming (except when you're using some kind of an iteratee library). Can you explain your reasoning? Roman

On Fri, Nov 9, 2012 at 10:22 PM, Roman Cheplyaka
* Johan Tibell
[2012-11-09 19:00:04-0800] As a community we should primary use strict ByteStrings and Texts. There are uses for the lazy variants (i.e. they are sometimes more efficient), but in general the strict versions should be preferred.
I'm fairly surprised by this advice.
I think that lazy BS/Text are a much safer default.
If there's not much text it wouldn't matter anyway, but for large amounts using strict BS/Text would disable incremental producing/consuming (except when you're using some kind of an iteratee library).
Can you explain your reasoning?
It better communicates intent. A e.g. lazy byte string can be used for two separate things: * to model a stream of bytes, or * to avoid costs due to concatenating strings. By using a strict byte string you make it clear that you're not trying to do the former (at some potential cost due to the latter). When you want to do the former it should be clear to the consumer that he/she better consume the string in an incremental manner as to preserve laziness and avoid space leaks (by forcing the whole string). -- Johan

On 10 November 2012 17:57, Johan Tibell
It better communicates intent. A e.g. lazy byte string can be used for two separate things:
* to model a stream of bytes, or * to avoid costs due to concatenating strings.
By using a strict byte string you make it clear that you're not trying to do the former (at some potential cost due to the latter). When you want to do the former it should be clear to the consumer that he/she better consume the string in an incremental manner as to preserve laziness and avoid space leaks (by forcing the whole string).
Good advice. And when you want to do the latter you should use a Builder[1] (or [2] if you're working with text). Bas [1] http://hackage.haskell.org/packages/archive/bytestring/0.10.2.0/doc/html/Dat... [2] http://hackage.haskell.org/packages/archive/text/0.11.2.3/doc/html/Data-Text...

On Sat, Nov 10, 2012 at 4:00 AM, Johan Tibell
As for type classes, I don't think we use them enough. Perhaps because Haskell wasn't developed as an engineering language, some good software engineering principles (code against an interface, not a concrete implementation) aren't used in out base libraries. One specific example is the lack of a sequence abstraction/type class, that all the string, list, and vector types could implement. Right now all these types try to implement a compatible interface (i.e. the traditional list interface), without a language mechanism to express that this is what they do.
I think the challenge is designing an abstraction that everyone is comfortable with. If you just make everything a class method (ListLike), it's ugly. If you don't, how do you figure out what goes in the class and what gets implemented on top of it? Is there any principled reason for it, or is it just ad hoc? How do you make sure that none of the implementations suffers a performance decrease? What about sequential vs. random access (list vs. array) issues? Should an interface be implemented if it's semantically reasonable, but slow? If you treat everything as a uniform sequence, doesn't that bring back the Unicode issues again? (And can you make it work for all of Text/ByteString (kind *), boxed Vectors and lists (* -> *), and unboxed vectors (* -> * with a constraint)? What about operations that change the element type? Surely it's possible with TypeFamilies, ConstraintKinds, and PolyKinds all available, but I'm not sure if it's obvious. Can it go into the Prelude if it uses extensions? Should it also support other containers, like Maps? And so on.) So my impression is that the reason the problem hasn't been solved yet is that it's hard. We do have some useful things: Functor, Foldable, Traversable, and the classes in Data.Key[1], but for starters none of them can be implemented by Text and ByteString, so that brings us back to square one. But a constructive idea: what if strict Text and ByteString were both synonyms for unboxed Vectors (already available in ByteString's case[2])? What if, for lazy Text and ByteString, we either had lazy Vectors to make them synonyms of, or a 'data Lazy v' which made a lazy chunked sequence out of any underlying strict Vector-ish type? That would cut down on the number of types, which is a good thing in itself, and it would suggest an obvious way to abstract over them: the existing Functor/Foldable/Traversable/Data.Key classes extended with an associated constraint. I'm not sure how much of the use cases that would cover, but certainly a lot more than we have now. It wouldn't solve every one of the questions above, but it anwers many of them, and it seems like a good compromise. The big drawbacks I can see are that (a) it would be a *lot* of work, especially if we want to be completely uncompromising on performance, and (b) I'm not sure how pinned arrays and interoperation with C would be handled without making it complicated again. (Though I suppose we could just punt and have ByteString be a synonym for Vector.Storable (pinned) and Text for Vector.Unboxed (not pinned) to mirror the current situation. Or maybe we could have a pinArray# primop?) Anyway, if I'm blue-sky dreaming, that's what looks appealing to me. [1] http://hackage.haskell.org/packages/archive/keys/3.0.1/doc/html/Data-Key.htm... [2] http://hackage.haskell.org/package/vector-bytestring -- Your ship was destroyed in a monadic eruption.

Andrew:
There is a ListLike package, which does this nice abstraction. but
I don't know if it is ready for and/or enough complete for serious usage.
I´m thinking into using it for the same reasons.
Anyone has some experiences to share about it?
2012/11/10 Andrew Pennebaker
Frequently when I'm coding in Haskell, the crux of my problem is converting between all the stupid string formats.
You've got String, ByteString, Lazy ByteString, Text, [Word], and on and on... I have to constantly lookup how to convert between them, and the overloaded strings GHC directive doesn't work, and sometimes ByteString.unpack doesn't work, because it expects [Word8], not [Char]. AAAAAAAAAAAAAAAAAAAH!!!
Haskell is a wonderful playground for experimentation. I've started to notice that many Hackage libraries are simply instances of typeclasses designed a while ago, and their underlying implementations are free to play around with various optimizations... But they ideally all expose the same interface through typeclasses.
Can we do the same with String? Can we pick a good compromise of lazy vs strict, flexible vs fast, and all use the same data structure? My vote is for type String = [Char], but I'm willing to switch to another data structure, just as long as it's consistently used.
-- Cheers,
Andrew Pennebaker www.yellosoft.us
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Alberto.

At Sat, 10 Nov 2012 15:16:30 +0100, Alberto G. Corona wrote:
There is a ListLike package, which does this nice abstraction. but I don't know if it is ready for and/or enough complete for serious usage. I´m thinking into using it for the same reasons.
Anyone has some experiences to share about it?
I've used it in the past and it's solid, it's been around for a while and the original author knows his Haskell. Things I don't like: * The classes are huge: http://hackage.haskell.org/packages/archive/ListLike/3.1.6/doc/html/Data-Lis.... I'd much rater prefer to have all those utilities functions outside the type class, for no particular reason other then the ugliness of the type class. * It defines its own wrappers for `ByteString': http://hackage.haskell.org/packages/archive/ListLike/3.1.6/doc/html/Data-Lis.... * It doesn't have instances for `Text', you have to resort to the `listlike-instances' package. In any case I think it's on the right track, I'd really like something like that, but much simpler, to be in `base'. Francesco
participants (7)
-
Alberto G. Corona
-
Andrew Pennebaker
-
Bas van Dijk
-
Francesco Mazzoli
-
Gábor Lehel
-
Johan Tibell
-
Roman Cheplyaka