Can we do better than duplicate APIs? [was: Data.CompactString 0.3]

[sorry for the somewhat longer rant, you may want to skip to the more technical questions at the end of the post] Twan van Laarhoven wrote:
I would like to announce version 0.3 of my Data.CompactString library. Data.CompactString is a wrapper around Data.ByteString that represents a Unicode string. This new version supports different encodings, as can be seen from the data type:
[...]
Homepage: http://twan.home.fmf.nl/compact-string/ Haddock: http://twan.home.fmf.nl/compact-string/doc/html/ Source: darcs get http://twan.home.fmf.nl/repos/compact-string
After taking a look at the Haddock docs, I was impressed by the amount of repetition in the APIs. Not ony does Data.CompactString duplicate the whole Data.ByteString interface (~100 functions, adding some more for encoding and decoding), the whole interface is again repeated another four times, once for each supported encoding. Now, this is /not/ meant as a criticism of the compact-string package in particular. To the contrary, duplicating a fat interface for almost identical functionality is apparently state-of-the-art in Haskell library design, viz. the celebrated Data.Bytesting, whose API is similarly repetitive (see Data.List, Data.ByteString.Lazy, etc...), as well as Map/IntMap/SetIntSet etc. I greatly appreciate the effort that went into these libraries, and admire the elegance of the implementation as well as the stunning results wrt. efficiency gains etc.. However I fear that duplicating interfaces in this way will prove to be problematic in the long run. The problems I (for-)see are for maintenance and usability, both of which are of course two sides of the same coin. For the library implementer, maintenance will become more difficult, as ever more of such 'almost equal' interfaces must be maintained and kept in sync. One could use code generation or macro expansion to alleviate this, but IMO the necessity to use extra-language pre-processors points to a weakness in the language; it be much less complicated and more satisfying to use a language feature that avoids the repetition instead of generating code to facilitate it. On the other side of teh coin, usability suffers as one has to lookup the (almost) same function in more and more different (but 'almost equal') module interfaces, depending on whether the string in question is Char vs. Byte, strict vs. lazy, packed vs. unpacked, encoded in X or Y or Z..., especially since there is no guarantee that the function is /really/ spelled the same everywhere and also really does what the user expects.(*) I am certain that most, if not all, people involved with these new libraries are well aware of these infelicities. Of course, type classes come to mind as a possible solution. However, something seems to prevent developers from using them to capture e.g. a common String or ListLike interface. Whatever this 'something' is, I think it should be discussed and addressed, before the number of 'almost equal' APIs becomes unmanageable for users and maintainers. Here are some raw ideas: One reason why I think type classes have not (yet) been used to reduce the amount of API repetition is that Haskell doesn't (directly) support abstraction over type constraints nor over the number of type parameters (polykinded types?). Often such 'almost equal' module APIs differ in exactly these aspects, i.e. one has an additional type parameter, while yet another one needs slightly different or additional constraints on certain types. Oleg K. has shown that some if these limitations can be overcome w/o changing or adding features to the language, however these tricks are not easy to learn and apply. Another problem is the engineering question of how much to put into the class proper: there is a tension between keeping the class as simple as possible (few methods, many parametric functions) for maximum usability vs. making it large (many methods, less parametric functions) for maximum efficiency via specialized implementations. It is often hard to decide this question up front, i.e. before enough instances are available. (This has been stated as a cause for defering the decision for a common interface to list-like values or strings). Since the type of a function doesn't reveal whether it is a normal function with a class constraint or a real class method, I imagine a language feature that (somehow) enables me to specialize such a function for a particular instance even if it is not a proper class member. Or maybe we have come to the point where Haskell's lack of a 'real' module system, like e.g. in SML, actually starts to hurt? Can associated types come to the rescue? Cheers Ben -- (*) I know that strictly speaking a class doesn't guarantee any semantic conformance either, but at least there is a common place to document the expected laws that all implementations should obey. With duplicated module APIs there is no such single place.

On Friday 23 March 2007 18:55, Benjamin Franksen wrote:
[sorry for the somewhat longer rant, you may want to skip to the more technical questions at the end of the post]
Twan van Laarhoven wrote:
I would like to announce version 0.3 of my Data.CompactString library. Data.CompactString is a wrapper around Data.ByteString that represents a Unicode string. This new version supports different encodings, as can be seen from the data type:
[...]
Homepage: http://twan.home.fmf.nl/compact-string/ Haddock: http://twan.home.fmf.nl/compact-string/doc/html/ Source: darcs get http://twan.home.fmf.nl/repos/compact-string
After taking a look at the Haddock docs, I was impressed by the amount of repetition in the APIs. Not ony does Data.CompactString duplicate the whole Data.ByteString interface (~100 functions, adding some more for encoding and decoding), the whole interface is again repeated another four times, once for each supported encoding.
I'd like to mention that as maintainer of Edison, I face similar difficulties. The data structure interfaces have scores of functions and there are about 20 different concrete implementations of various sorts. Even minor interface changes require a lot of tedious editing to make sure that everything stays in sync. [snip]
The problems I (for-)see are for maintenance and usability, both of which are of course two sides of the same coin. For the library implementer, maintenance will become more difficult, as ever more of such 'almost equal' interfaces must be maintained and kept in sync.
This is true. For the specific case of Edison, this is not too bad because all the implementations are currently in one package and under the control of one person (namely, me). That might, however, change in the future. Obviously, it is a problem _now_ for the ByteString and friends.
One could use code generation or macro expansion to alleviate this, but IMO the necessity to use extra-language pre-processors points to a weakness in the language; it be much less complicated and more satisfying to use a language feature that avoids the repetition instead of generating code to facilitate it.
I've considered something like this for Edison. Actually, I've considered going even further and building the Edison concrete implementations in a theorem prover to prove correctness and then extracting the Haskell source. Some sort of in-langauge or extra-language support for mechanicly producing the source files for the full API from the optimized "core" API would be quite welcome. Handling export lists, haddock comments, typeclass instances, etc, are quite tedious. I have to admit, I'm not sure what an in-language mechanism for doing something like this would look like. Template Haskell is an option, I suppose, but its pretty hard to work with and highly non-portable. It also wouldn't produce Haddock-consumable source files. ML-style first class modules might fit the bill, but I'm not sure anyone is seriously interested in bolting that onto Haskell. [snip]
Here are some raw ideas:
One reason why I think type classes have not (yet) been used to reduce the amount of API repetition is that Haskell doesn't (directly) support abstraction over type constraints nor over the number of type parameters (polykinded types?). Often such 'almost equal' module APIs differ in exactly these aspects, i.e. one has an additional type parameter, while yet another one needs slightly different or additional constraints on certain types. Oleg K. has shown that some if these limitations can be overcome w/o changing or adding features to the language, however these tricks are not easy to learn and apply.
I mostly put these kinds of type system tricks in the same category as TH: hard to use and non-portable. [snip]
Or maybe we have come to the point where Haskell's lack of a 'real' module system, like e.g. in SML, actually starts to hurt? Can associated types come to the rescue?
I'm not familiar enough with associated types to know if they would help with these sorts of problems. Can someone more knowledgable comment?
Cheers Ben
Rob Dockins

Robert Dockins wrote:
After taking a look at the Haddock docs, I was impressed by the amount of repetition in the APIs. Not ony does Data.CompactString duplicate the whole Data.ByteString interface (~100 functions, adding some more for encoding and decoding), the whole interface is again repeated another four times, once for each supported encoding.
I'd like to mention that as maintainer of Edison, I face similar difficulties. The data structure interfaces have scores of functions and there are about 20 different concrete implementations of various sorts. Even minor interface changes require a lot of tedious editing to make sure that everything stays in sync.
One could use code generation or macro expansion to alleviate this, but IMO the necessity to use extra-language pre-processors points to a weakness in the language; it be much less complicated and more satisfying to use a language feature
avoids the repetition instead of generating code to facilitate it.
I've considered something like this for Edison. Actually, I've considered going even further and building the Edison concrete implementations in a theorem prover to prove correctness and then extracting the Haskell
Some sort of in-langauge or extra-language support for mechanicly
But... you have the type of all functions nailed down in classes. Thus, even if a change in the API means a lot of tedious work adapting the concrete implementations, at least the compiler helps you to check that the implementations will conform to the interface (class); and users have to consult only the API docs, and not every single function in all 20 implementations. With ByteString and friends there is (yet) no common interface laid down anywhere. All the commonality is based on custom and good sense and the willingness and ability of the developers to make their interfaces compatible to those of others. that source. producing
the source files for the full API from the optimized "core" API would be quite welcome. Handling export lists,
How so? I thought in Edision the API is a set of type classes. Doesn't that mean export lists can be empty (since instances are exported automatically)?
haddock comments,
I thought all the documentation would be in the API classes, not in the concrete implementations.
typeclass instances, etc, are quite tedious.
I have to admit, I'm not sure what an in-language mechanism for doing something like this would look like. Template Haskell is an option, I suppose, but its pretty hard to work with and highly non-portable. It also wouldn't produce Haddock-consumable source files. ML-style first class modules might fit the bill, but I'm not sure anyone is seriously interested in bolting that onto Haskell.
As I explained to SPJ, I am less concerned with duplicated work when implementing concrete data structures, as with the fact that there is still no (compiler checkable) common interface for e.g. string-like thingies, apart from convention to use similar names for similar features. Cheers Ben

On Mar 28, 2007, at 2:44 PM, Benjamin Franksen wrote:
Robert Dockins wrote:
After taking a look at the Haddock docs, I was impressed by the amount of repetition in the APIs. Not ony does Data.CompactString duplicate the whole Data.ByteString interface (~100 functions, adding some more for encoding and decoding), the whole interface is again repeated another four times, once for each supported encoding.
I'd like to mention that as maintainer of Edison, I face similar difficulties. The data structure interfaces have scores of functions and there are about 20 different concrete implementations of various sorts. Even minor interface changes require a lot of tedious editing to make sure that everything stays in sync.
But... you have the type of all functions nailed down in classes. Thus, even if a change in the API means a lot of tedious work adapting the concrete implementations, at least the compiler helps you to check that the implementations will conform to the interface (class);
This is true.
and users have to consult only the API docs, and not every single function in all 20 implementations. With ByteString and friends there is (yet) no common interface laid down anywhere. All the commonality is based on custom and good sense and the willingness and ability of the developers to make their interfaces compatible to those of others.
One could use code generation or macro expansion to alleviate this, but IMO the necessity to use extra-language pre-processors points to a weakness in the language; it be much less complicated and more satisfying to use a language feature that avoids the repetition instead of generating code to facilitate it.
I've considered something like this for Edison. Actually, I've considered going even further and building the Edison concrete implementations in a theorem prover to prove correctness and then extracting the Haskell source. Some sort of in-langauge or extra-language support for mechanicly producing the source files for the full API from the optimized "core" API would be quite welcome. Handling export lists,
How so? I thought in Edision the API is a set of type classes. Doesn't that mean export lists can be empty (since instances are exported automatically)?
No. Edison allows you to directly import the module and bypass the typeclass APIs if you wish. Also, some implementations have special functions that are not part of the general API, and are only available via the module exports. One could make typeclasses the only way to access the main API, but I rather suspect there would be performance implications. I get the impression that typeclass specialization is less advanced than intermodule inlining (could be wrong though).
haddock comments,
I thought all the documentation would be in the API classes, not in the concrete implementations.
It is now, but I've gotten complaints about that (which are at least semi-justified, I feel). Also, the various implementations have different time bounds which must documented in the individual modules. Ideally, I'd like to have the function documentation string and the time bounds on each function in each concrete implementation. I've not done this because its just too painful to maintain manually.
typeclass instances, etc, are quite tedious.
I have to admit, I'm not sure what an in-language mechanism for doing something like this would look like. Template Haskell is an option, I suppose, but its pretty hard to work with and highly non- portable. It also wouldn't produce Haddock-consumable source files. ML-style first class modules might fit the bill, but I'm not sure anyone is seriously interested in bolting that onto Haskell.
As I explained to SPJ, I am less concerned with duplicated work when implementing concrete data structures, as with the fact that there is still no (compiler checkable) common interface for e.g. string-like thingies, apart from convention to use similar names for similar features.
Fair enough. I guess my point is that typeclasses (ad per Edison) are only a partial solution to this problem, even if you can stretch them sufficiently (with eg, MPTC+fundeps+whatever other extension) to make them cover all your concrete implementations.
Cheers Ben
Rob Dockins Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG

Robert Dockins wrote:
Some sort of in-langauge or extra-language support for mechanicly producing the source files for the full API from the optimized "core" API would be quite welcome.
Have you considered using DrIFT? IIRC it is more portable and easier to use than TH.
Handling export lists,
How so? I thought in Edision the API is a set of type classes. Doesn't that mean export lists can be empty (since instances are exported automatically)?
No. Edison allows you to directly import the module and bypass the typeclass APIs if you wish.
Ah, I didn't know that.
Also, some implementations have special functions that are not part of the general API, and are only available via the module exports.
Ok.
One could make typeclasses the only way to access the main API, but I rather suspect there would be performance implications. I get the impression that typeclass specialization is less advanced than intermodule inlining (could be wrong though).
No idea. Experts?
haddock comments,
I thought all the documentation would be in the API classes, not in the concrete implementations.
It is now, but I've gotten complaints about that (which are at least semi-justified, I feel). Also, the various implementations have different time bounds which must documented in the individual modules.
Yes, I forgot about that. Hmmm.
Ideally, I'd like to have the function documentation string and the time bounds on each function in each concrete implementation. I've not done this because its just too painful to maintain manually.
I can relate to that. The more so since establishing such time bounds with confidence is not trivial even if the code looks simple. BTW, code generation (of whatever sort) wouldn't help with that, right? I wonder: would it be worthwhile to split the package into smaller parts that could be upgraded in a somewhat less synchronous way? (so that the maintenance effort can be spread over a longer period)
I have to admit, I'm not sure what an in-language mechanism for doing something like this would look like. Template Haskell is an option, I suppose, but its pretty hard to work with and highly non- portable. It also wouldn't produce Haddock-consumable source files. ML-style first class modules might fit the bill, but I'm not sure anyone is seriously interested in bolting that onto Haskell.
As I explained to SPJ, I am less concerned with duplicated work when implementing concrete data structures, as with the fact that there is still no (compiler checkable) common interface for e.g. string-like thingies, apart from convention to use similar names for similar features.
Fair enough. I guess my point is that typeclasses (ad per Edison) are only a partial solution to this problem, even if you can stretch them sufficiently (with eg, MPTC+fundeps+whatever other extension) to make them cover all your concrete implementations.
Yes, and I think these problems would be worth some more research effort. Besides, I dearly hope that we can soon experiment with associated type synonyms... Cheers Ben

On Wednesday 28 March 2007 17:08, Benjamin Franksen wrote:
Robert Dockins wrote:
Some sort of in-langauge or extra-language support for mechanicly
producing
the source files for the full API from the optimized "core" API would be quite welcome.
Have you considered using DrIFT? IIRC it is more portable and easier to use than TH.
DrIFT only works on datatype declarations (AFAIK) and doesn't really cover the use cases in question. [snip]
haddock comments,
I thought all the documentation would be in the API classes, not in the concrete implementations.
It is now, but I've gotten complaints about that (which are at least semi-justified, I feel). Also, the various implementations have different time bounds which must documented in the individual modules.
Yes, I forgot about that. Hmmm.
Ideally, I'd like to have the function documentation string and the time bounds on each function in each concrete implementation. I've not done this because its just too painful to maintain manually.
I can relate to that. The more so since establishing such time bounds with confidence is not trivial even if the code looks simple. BTW, code generation (of whatever sort) wouldn't help with that, right?
Well, I can't imagine any tool that would prove the bounds for me unless automatic proof techniques have improved a _lot_ in the last week or so ;-) However, if I could record the bounds once somewhere for each implementation and then have them auto merged with the documentation for each function, that would be great.
I wonder: would it be worthwhile to split the package into smaller parts that could be upgraded in a somewhat less synchronous way? (so that the maintenance effort can be spread over a longer period)
Perhaps, but that only amortizes the effort rather than reducing it. [snip]
As I explained to SPJ, I am less concerned with duplicated work when implementing concrete data structures, as with the fact that there is still no (compiler checkable) common interface for e.g. string-like thingies, apart from convention to use similar names for similar features.
Fair enough. I guess my point is that typeclasses (ad per Edison) are only a partial solution to this problem, even if you can stretch them sufficiently (with eg, MPTC+fundeps+whatever other extension) to make them cover all your concrete implementations.
Yes, and I think these problems would be worth some more research effort.
Agreed.
Besides, I dearly hope that we can soon experiment with associated type synonyms...
Cheers Ben
Rob Dockins

On Wed, 2007-03-28 at 20:44 +0200, Benjamin Franksen wrote:
But... you have the type of all functions nailed down in classes. Thus, even if a change in the API means a lot of tedious work adapting the concrete implementations, at least the compiler helps you to check that the implementations will conform to the interface (class); and users have to consult only the API docs, and not every single function in all 20 implementations. With ByteString and friends there is (yet) no common interface laid down anywhere. All the commonality is based on custom and good sense and the willingness and ability of the developers to make their interfaces compatible to those of others.
Remember that there's more to an API than a bunch of types. The type class only ensures common types. You must still rely on the good sense and ability of the developers to ensure other properties like strictness, time complexity and simply what the functions should do. Duncan

[Probably libraries@haskell.org is the right list for this message, so I'm fwding your message below, and will reply there.] | -----Original Message----- | From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Benjamin | Franksen | Sent: 23 March 2007 22:56 | To: haskell-cafe@haskell.org | Cc: haskell@haskell.org | Subject: [Haskell-cafe] Can we do better than duplicate APIs? [was: Data.CompactString 0.3] | | [sorry for the somewhat longer rant, you may want to skip to the more | technical questions at the end of the post] | | Twan van Laarhoven wrote: | > I would like to announce version 0.3 of my Data.CompactString library. | > Data.CompactString is a wrapper around Data.ByteString that represents a | > Unicode string. This new version supports different encodings, as can be | > seen from the data type: | > | > [...] | > | > Homepage: http://twan.home.fmf.nl/compact-string/ | > Haddock: http://twan.home.fmf.nl/compact-string/doc/html/ | > Source: darcs get http://twan.home.fmf.nl/repos/compact-string | | After taking a look at the Haddock docs, I was impressed by the amount of | repetition in the APIs. Not ony does Data.CompactString duplicate the whole | Data.ByteString interface (~100 functions, adding some more for encoding | and decoding), the whole interface is again repeated another four times, | once for each supported encoding. | | Now, this is /not/ meant as a criticism of the compact-string package in | particular. To the contrary, duplicating a fat interface for almost | identical functionality is apparently state-of-the-art in Haskell library | design, viz. the celebrated Data.Bytesting, whose API is similarly | repetitive (see Data.List, Data.ByteString.Lazy, etc...), as well as | Map/IntMap/SetIntSet etc. I greatly appreciate the effort that went into | these libraries, and admire the elegance of the implementation as well as | the stunning results wrt. efficiency gains etc.. However I fear that | duplicating interfaces in this way will prove to be problematic in the long | run. | | The problems I (for-)see are for maintenance and usability, both of which | are of course two sides of the same coin. For the library implementer, | maintenance will become more difficult, as ever more of such 'almost equal' | interfaces must be maintained and kept in sync. One could use code | generation or macro expansion to alleviate this, but IMO the necessity to | use extra-language pre-processors points to a weakness in the language; it | be much less complicated and more satisfying to use a language feature that | avoids the repetition instead of generating code to facilitate it. On the | other side of teh coin, usability suffers as one has to lookup the (almost) | same function in more and more different (but 'almost equal') module | interfaces, depending on whether the string in question is Char vs. Byte, | strict vs. lazy, packed vs. unpacked, encoded in X or Y or Z..., especially | since there is no guarantee that the function is /really/ spelled the same | everywhere and also really does what the user expects.(*) | | I am certain that most, if not all, people involved with these new libraries | are well aware of these infelicities. Of course, type classes come to mind | as a possible solution. However, something seems to prevent developers from | using them to capture e.g. a common String or ListLike interface. Whatever | this 'something' is, I think it should be discussed and addressed, before | the number of 'almost equal' APIs becomes unmanageable for users and | maintainers. | | Here are some raw ideas: | | One reason why I think type classes have not (yet) been used to reduce the | amount of API repetition is that Haskell doesn't (directly) support | abstraction over type constraints nor over the number of type parameters | (polykinded types?). Often such 'almost equal' module APIs differ in | exactly these aspects, i.e. one has an additional type parameter, while yet | another one needs slightly different or additional constraints on certain | types. Oleg K. has shown that some if these limitations can be overcome w/o | changing or adding features to the language, however these tricks are not | easy to learn and apply. | | Another problem is the engineering question of how much to put into the | class proper: there is a tension between keeping the class as simple as | possible (few methods, many parametric functions) for maximum usability vs. | making it large (many methods, less parametric functions) for maximum | efficiency via specialized implementations. It is often hard to decide this | question up front, i.e. before enough instances are available. (This has | been stated as a cause for defering the decision for a common interface to | list-like values or strings). Since the type of a function doesn't reveal | whether it is a normal function with a class constraint or a real class | method, I imagine a language feature that (somehow) enables me to | specialize such a function for a particular instance even if it is not a | proper class member. | | Or maybe we have come to the point where Haskell's lack of a 'real' module | system, like e.g. in SML, actually starts to hurt? Can associated types | come to the rescue? | | Cheers | Ben | -- | (*) I know that strictly speaking a class doesn't guarantee any semantic | conformance either, but at least there is a common place to document the | expected laws that all implementations should obey. With duplicated module | APIs there is no such single place. | | _______________________________________________ | Haskell-Cafe mailing list | Haskell-Cafe@haskell.org | http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (4)
-
Benjamin Franksen
-
Duncan Coutts
-
Robert Dockins
-
Simon Peyton-Jones