
Hi all, As we consider what we want to go into the Haskell Platform, I would like to seriously discuss the issue of container polymorphism, especially, but not exclusively, in dealing with strings. We currently have a large number of container types, many of which are going to co-exist long-term, i.e. there is no winner; they are useful in different situations. Here is an incomplete list of what we are dealing with: Fully polymorphic sequence types: [] Data.Sequence Data.Edison sequence types Data.Array.Array & friends Restricted polymorphic sequence types: uvector vector storable-vector Monomorphic sequence types: ByteString (Word8 and Char8) Lazy ByteString (Word8 and Char8) Text Fully polymorphic associative array types: Data.Map Edison associative arrays gmap (?) Less-polymorphic associative array types: bytestring-tries list-tries Data.IntMap I'm sure there are more, of course, but those are the ones that a cursory search reveals. The question, then, is how we ought to deal with all of these types in the Haskell Platform. The current popular libraries deal with the multitude of container types in a very ad-hoc way. For instance, the Parsec parsing library allows you to parse Strings and Strict ByteStrings; it also includes a class so that you can parse other things. Attoparsec allows you to parse only lazy bytestrings. I don't know of any library that parses Text yet. The binary library deals with lazy bytestrings, while binary-strict lets you use strict bytestrings. Libraries like zlib just pick a type (lazy bytestrings) and use that, not accounting for other possible types you may want to use (granted, in zlib's case, the choice of a single data structure, seems fairly reasonable, given potential use cases). The point here is that the same functionality and practically the same code can be split over several different packages based on what data structure you want to use. Worse, there is no standard way of dealing with this separation: Parsec uses a type class, while binary uses two separate packages. And interoperability could potentially become a problem: the split library, for instance, only supports lists, so you have to roll your own solutions for splitting ByteStrings or Text. I think that with the introduction of the Haskell Platform, we ought to take the opportunity to clean up some of this chaos so that we have a clean set of standard libraries for people to build on. The way I see it, we have a few options. We could introduce some type classes to cover the various container operations and ask all Platform packages to only use operations from these classes, where applicable. (The ListLike library would probably be a good place to start.) The issue here is efficiency, massive class dictionaries, and type system issues (we need MPTCs or TFs; also, since not all sequence types are polymorphic, we probably need to have separate map and rigidMap functions like in ListLike, and even that doesn't work for types like uvector that can take some, but not all, concrete element types). We could also implement a standard module-naming scheme: for every package that operates on strings, for instance, we could require the list-based code to go in Foo.Bar.String and the Text code to go in Foo.Bar.Text. Whenever we "blessed" a new string datatype (presumably not very often), all packages would have to add a new module. The issue is code duplication. We could also implement a package naming scheme: have foo-text and foo-string for the two different datatypes. What does everyone think about the need for standardization here and how we ought to do it? I'm sorry that, having less experience with Haskell than many, I can't give as much in the way of concrete proposals, but I think it's important that we hash something out here to get more organization in our Platform. Thanks for your time in reading this admittedly long message, Alex

Alexander Dunlap wrote:
Hi all,
As we consider what we want to go into the Haskell Platform, I would like to seriously discuss the issue of container polymorphism, ... Thanks for your time in reading this admittedly long message,
Thanks for considering it! I think these are important issues; however, they're difficult ones that have been around for a while, and I don't think we'll be able to solve them quickly just because we're making a Haskell Platform. In some cases issues aren't as bad as they could be. If a library will take Lists of anything, then you can convert your type to a List. If its semantics mean it can only operate on bytes, then it can take lazy-ByteStrings (or Strict if it only operates on small things or if it can't stream data anyway)... Although it may be efficient, bytestrings work badly or not at all for non-byte data. Incidentally, at some point in the future, a decision may depend on what compiler-extensions we're willing to allow into the Platform (I suspect that a nice polymorphic solution [with possibly speed problems] basically requires Type Functions or equivalent..) -Isaac

Alexander Dunlap wrote:
The way I see it, we have a few options. We could introduce some type classes to cover the various container operations and ask all Platform packages to only use operations from these classes, where applicable.
Depending on what exactly you mean by "applicable", there are worse efficiency problems than large dictionaries. Speaking on behalf of Data.Trie, there are a number of functions that can be offered on tries which give large asymptotic improvements over naive/generic functions on maps. In general the efficiency of tries is comperable with maps, the whole point of using tries is to be able to use these special functions which aren't available to maps. This problem isn't unique to tries, it's the major reason why there are so many container types out there. For instance, the snoc function can be defined for [] but it's not a good idea to use it frequently; snoc on Data.Sequence, however, is perfectly fine to use often. While it'd be nice to have some type classes to make it easier for client code to be polymorphic in the container type, we should be wary of putting too much faith in it. When there are asymptotic or large-constant differences depending on the container, this needs to be made clear to clients and users. This isn't just polymorphism over the representation type, it's polymorphism over *algorithms*. This is a major source of performance problems and confusion in OOP languages, which do provide such generic interfaces. For a language like Haskell the computational complexity of functions belongs to the API and is as important as the type signature and pragmatic properties. One thing that may help though is if we could agree on what functions should belong to such an interface and how they should be named. The bytestring-trie package took after Data.Map and Data.IntMap, though there's been much contention over the names and the order of arguments since then. Having a coherent set of guidelines (specialized to sequence and map structures) would probably help more than technical changes. -- Live well, ~wren

On Thu, Aug 6, 2009 at 12:15 AM, Alexander
Dunlap
Hi all,
As we consider what we want to go into the Haskell Platform, I would like to seriously discuss the issue of container polymorphism, especially, but not exclusively, in dealing with strings. We currently have a large number of container types, many of which are going to co-exist long-term, i.e. there is no winner; they are useful in different situations. Here is an incomplete list of what we are dealing with:
Fully polymorphic sequence types: [] Data.Sequence Data.Edison sequence types Data.Array.Array & friends
Restricted polymorphic sequence types: uvector vector storable-vector
Monomorphic sequence types: ByteString (Word8 and Char8) Lazy ByteString (Word8 and Char8) Text
Fully polymorphic associative array types: Data.Map Edison associative arrays gmap (?)
Less-polymorphic associative array types: bytestring-tries list-tries Data.IntMap
I'm sure there are more, of course, but those are the ones that a cursory search reveals.
The question, then, is how we ought to deal with all of these types in the Haskell Platform. The current popular libraries deal with the multitude of container types in a very ad-hoc way. For instance, the Parsec parsing library allows you to parse Strings and Strict ByteStrings; it also includes a class so that you can parse other things. Attoparsec allows you to parse only lazy bytestrings. I don't know of any library that parses Text yet. The binary library deals with lazy bytestrings, while binary-strict lets you use strict bytestrings. Libraries like zlib just pick a type (lazy bytestrings) and use that, not accounting for other possible types you may want to use (granted, in zlib's case, the choice of a single data structure, seems fairly reasonable, given potential use cases).
The point here is that the same functionality and practically the same code can be split over several different packages based on what data structure you want to use. Worse, there is no standard way of dealing with this separation: Parsec uses a type class, while binary uses two separate packages. And interoperability could potentially become a problem: the split library, for instance, only supports lists, so you have to roll your own solutions for splitting ByteStrings or Text. I think that with the introduction of the Haskell Platform, we ought to take the opportunity to clean up some of this chaos so that we have a clean set of standard libraries for people to build on.
The way I see it, we have a few options. We could introduce some type classes to cover the various container operations and ask all Platform packages to only use operations from these classes, where applicable. (The ListLike library would probably be a good place to start.) The issue here is efficiency, massive class dictionaries, and type system issues (we need MPTCs or TFs; also, since not all sequence types are polymorphic, we probably need to have separate map and rigidMap functions like in ListLike, and even that doesn't work for types like uvector that can take some, but not all, concrete element types).
We could also implement a standard module-naming scheme: for every package that operates on strings, for instance, we could require the list-based code to go in Foo.Bar.String and the Text code to go in Foo.Bar.Text. Whenever we "blessed" a new string datatype (presumably not very often), all packages would have to add a new module. The issue is code duplication.
We could also implement a package naming scheme: have foo-text and foo-string for the two different datatypes.
What does everyone think about the need for standardization here and how we ought to do it? I'm sorry that, having less experience with Haskell than many, I can't give as much in the way of concrete proposals, but I think it's important that we hash something out here to get more organization in our Platform.
Thanks for your time in reading this admittedly long message, Alex
Both comments are well-taken; thank you very much for the feedback! I'm starting to understand more about the issue and have a more concrete idea to suggest. Some of the most common libraries for dealing with strings seem to be parsers. Parsing libraries would benefit a lot by becoming polymorphic, as Strings, ByteStrings, Texts, and sequences of other token types all have legitimate reasons to be parsed. Furthermore, parsers seem to be libraries that use a very small number of functions from the string type: I think something along the lines of the following would suffice: class Monoid c => Parseable c where type El c :: * uncons :: c -> Maybe (El c, c) splitAt :: Int -> c -> (c,c) cons :: El c -> c -> c span, break :: (El c -> Bool) -> c -> (c,c) with mempty and mappend used from the Monoid class. I don't think this would have very great efficiency implications, since most of the parsing libraries seem to use those functions anyway. I will try to mock up an example of this and try some tests and benchmarks to see if I can come up with an even more concrete proposal. Any further input is very much appreciated. Thanks! Alex

Hi Alex,
On Thu, Aug 6, 2009 at 9:15 AM, Alexander
Dunlap
The way I see it, we have a few options. We could introduce some type classes to cover the various container operations and ask all Platform packages to only use operations from these classes, where applicable. (The ListLike library would probably be a good place to start.) The issue here is efficiency, massive class dictionaries, and type system issues (we need MPTCs or TFs; also, since not all sequence types are polymorphic, we probably need to have separate map and rigidMap functions like in ListLike, and even that doesn't work for types like uvector that can take some, but not all, concrete element types).
I don't like the *Like naming much. I think that the type class deserves the shorter name and concrete instantiations should use more specialized names. Sequence wouldn't be a too bad name for data types that support indexing. Stream would be a good name for types that don't. There's an issue with regards to monads though. What type classes do we use for a Stream backed by e.g. an I/O resource? Example using a concrete element type: class ByteStream s where uncons :: s -> (Word8, s) null :: s -> Bool class ByteStreamM s where uncons :: Monad m => s -> m (Word8, s) null :: Monad m => s -> m Bool Do we need two type classes for each "concept" (e.g. stream, sequence)?
We could also implement a standard module-naming scheme: for every package that operates on strings, for instance, we could require the list-based code to go in Foo.Bar.String and the Text code to go in Foo.Bar.Text. Whenever we "blessed" a new string datatype (presumably not very often), all packages would have to add a new module. The issue is code duplication.
I would like to propose this as a general recommendation for module naming as there's already plenty of modules that do it this way. It also seems to be the most sensible place to put the data type as it groups related modules in the module hierarchy and thus in the documentation. I don't know about requiring all packages to add a new module every time there's a new type though.
We could also implement a package naming scheme: have foo-text and foo-string for the two different datatypes.
This seems like a good scheme to me. It's the same <general>-<specific> scheme that's proposed for modules above.
What does everyone think about the need for standardization here and how we ought to do it? I'm sorry that, having less experience with Haskell than many, I can't give as much in the way of concrete proposals, but I think it's important that we hash something out here to get more organization in our Platform.
I definitely think the container problem needs to be tackled. However, I don't know what the solution should look like. Cheers, Johan

Note the Steam classes you mention already have an analogue hiding in parsec
3 -- minus the null check that is.
Since, there, unconsing returns m (Maybe (c,s)). rather than m (c,s),
permitting the check and uncons to be performed in the same operation, but
denying access to them separately.
-Edward Kmett
On Fri, Aug 7, 2009 at 4:49 AM, Johan Tibell
Hi Alex,
On Thu, Aug 6, 2009 at 9:15 AM, Alexander Dunlap
wrote: The way I see it, we have a few options. We could introduce some type classes to cover the various container operations and ask all Platform packages to only use operations from these classes, where applicable. (The ListLike library would probably be a good place to start.) The issue here is efficiency, massive class dictionaries, and type system issues (we need MPTCs or TFs; also, since not all sequence types are polymorphic, we probably need to have separate map and rigidMap functions like in ListLike, and even that doesn't work for types like uvector that can take some, but not all, concrete element types).
I don't like the *Like naming much. I think that the type class deserves the shorter name and concrete instantiations should use more specialized names. Sequence wouldn't be a too bad name for data types that support indexing. Stream would be a good name for types that don't. There's an issue with regards to monads though. What type classes do we use for a Stream backed by e.g. an I/O resource? Example using a concrete element type:
class ByteStream s where uncons :: s -> (Word8, s) null :: s -> Bool
class ByteStreamM s where uncons :: Monad m => s -> m (Word8, s) null :: Monad m => s -> m Bool
Do we need two type classes for each "concept" (e.g. stream, sequence)?
We could also implement a standard module-naming scheme: for every package that operates on strings, for instance, we could require the list-based code to go in Foo.Bar.String and the Text code to go in Foo.Bar.Text. Whenever we "blessed" a new string datatype (presumably not very often), all packages would have to add a new module. The issue is code duplication.
I would like to propose this as a general recommendation for module naming as there's already plenty of modules that do it this way. It also seems to be the most sensible place to put the data type as it groups related modules in the module hierarchy and thus in the documentation. I don't know about requiring all packages to add a new module every time there's a new type though.
We could also implement a package naming scheme: have foo-text and foo-string for the two different datatypes.
This seems like a good scheme to me. It's the same <general>-<specific> scheme that's proposed for modules above.
What does everyone think about the need for standardization here and how we ought to do it? I'm sorry that, having less experience with Haskell than many, I can't give as much in the way of concrete proposals, but I think it's important that we hash something out here to get more organization in our Platform.
I definitely think the container problem needs to be tackled. However, I don't know what the solution should look like.
Cheers,
Johan _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
participants (5)
-
Alexander Dunlap
-
Edward Kmett
-
Isaac Dupree
-
Johan Tibell
-
wren ng thornton