could we get a Data instance for Data.Text.Text?

Hello, Would it be possible to get a Data instance for Data.Text.Text? This would allow us to create a Serialize instance of Text for use with happstack -- which would be extremely useful. We (at seereason) are currently using this patch: http://src.seereason.com/haskell-text-debian/debian/patches/add_Data_instanc... which basically adds: +textType = mkStringType "Data.Text" + +instance Data Text where + toConstr x = mkStringConstr textType (unpack x) + gunfold _k z c = case constrRep c of + (CharConstr x) -> z (pack [x]) + _ -> error "gunfold for Data.Text" + dataTypeOf _ = textType + This particular implementation avoids exposing the internals of the Data.Text type by casting it to a String in toConstr and gunfold. That is similar to how Data is implemented for some numeric types. However, the space usage of casting in Float to a Double is far less than casting a Text to a String, so maybe that is not a good idea? Alternatively, Data.ByteString just does 'deriving Data'. However, bytestring also exports Data.ByteString.Internal, wheres Data.Text.Internal is not exported. Any thoughts? I would like to get this handled upstream so that all happstack users can benefit from it. - jeremy

On Fri, Jan 22, 2010 at 2:24 PM, Jeremy Shaw
Would it be possible to get a Data instance for Data.Text.Text?
From the last time this came up, I gather that the correctish thing to do (for reasons too obscure to me) is to teach SYB and its many cousins about Text, or else there'll be some sort of disturbance in the Force.
If that feels too arduous, I'd consider adding your suggested instance of Data until such time as the One True Generics Package emerges to walk the earth. But please give it a think first.

Would it be possible to get a Data instance for Data.Text.Text?
From the last time this came up, I gather that the correctish thing to do (for reasons too obscure to me) is to teach SYB and its many cousins about Text, or else there'll be some sort of disturbance in the Force.
No, that's definitely not correct, or even remotely scalable as we increase the number of abstract types in disparate packages. If someone suggests it's necessary for their generics library, I suggest you use Uniplate ;-) There are two options, both listed in the above email. 1) Use string conversion in the instance. This is morally correct, and works perfectly. However, as mentioned, it's not great performing. The Map/Set instances both do a similar trick. 2) Just add deriving on the Data type, and hope no one abuses the internals. This is what ByteString does, it works great, it's fast, but you are violating some amount of abstraction. You have to trust people not to break that abstraction, but it's not a simple abstraction to break - it's the moral equivalent of pointer prodding in a std::string, no one breaks it accidentally.
If that feels too arduous, I'd consider adding your suggested instance of Data until such time as the One True Generics Package emerges to walk the earth. But please give it a think first.
Data.Data is the one true runtime reflection package, so Data instances are strongly advised, totally ignoring Generics stuff. I would pick option 2, but a Data instance really is useful. Thanks, Neil

On Sat, Jan 23, 2010 at 7:57 AM, Neil Mitchell
No, that's definitely not correct, or even remotely scalable as we increase the number of abstract types in disparate packages.
Yes.. happstack is facing another aspect of this scalability issue as well. We have a class, Serialize, which is used to serialize and deserialize data. It builds on the binary library, but adds the ability to version your data types and migrate data from older versions to newer versions. This has a serious scalability issue though, because it requires that each type a user might want to serialize has a Serialize instance. So do we: 1. provide Serialize instances for as many data types from libraries on hackage as we can, resulting in depending on a large number of packages that people are required to install, even though they will only use a small fraction of them. 2. convince people that Serialize deserves the same status as Data, and then convince authors to create Serialize instances for their type? It would be nice, but authors will start complaining if they are asked to provide a zillion other instances for their types as well. And they will be annoyed if they their library has to depend on a bunch of other libraries, just so they can provide some instances that only a small fraction of their users might use. So, this method does not scale as the number of 'interesting' classes grows. 3. let individual users define the Serialize instances as they need them. Unfortunately, if two different library authors defined a Serialize instance for Text in their libraries, you could not use both libraries in your application because of the conflicting Serialize instances. So this method does not scale when the number of libraries using the Serialize class grows. Not really sure what the work around is. #1 could work if there was some way to just selectively install the pieces as you need them. But the only way to do this now would be to create a lot of cabal packages which just defined a single instance -- happstack-text, happstack-map, happstack-time, happstack-etc. One for each package that has types we want to create a serialization instance for... Any other suggestions? - jeremy

On Sat, 23 Jan 2010 16:57:49 -0600, Jeremy Shaw
On Sat, Jan 23, 2010 at 7:57 AM, Neil Mitchell
wrote: No, that's definitely not correct, or even remotely scalable as we increase the number of abstract types in disparate packages.
Yes.. happstack is facing another aspect of this scalability issue as well. We have a class, Serialize, which is used to serialize and deserialize data. It builds on the binary library, but adds the ability to version your data types and migrate data from older versions to newer versions.
This has a serious scalability issue though, because it requires that each type a user might want to serialize has a Serialize instance.
So do we:
[..]
Any other suggestions?
4. Write a new package: * serialize-text * text-instances (which would be a place holder for more instances) I would go for trying solution 2. and otherwise solution 4. -- Nicolas Pouillard http://nicolaspouillard.fr

On Sat, Jan 23, 2010 at 4:57 PM, Jeremy Shaw
On Sat, Jan 23, 2010 at 7:57 AM, Neil Mitchell
wrote: No, that's definitely not correct, or even remotely scalable as we increase the number of abstract types in disparate packages.
Yes.. happstack is facing another aspect of this scalability issue as well. We have a class, Serialize, which is used to serialize and deserialize data. It builds on the binary library, but adds the ability to version your data types and migrate data from older versions to newer versions. This has a serious scalability issue though, because it requires that each type a user might want to serialize has a Serialize instance. So do we: 1. provide Serialize instances for as many data types from libraries on hackage as we can, resulting in depending on a large number of packages that people are required to install, even though they will only use a small fraction of them. 2. convince people that Serialize deserves the same status as Data, and then convince authors to create Serialize instances for their type? It would be nice, but authors will start complaining if they are asked to provide a zillion other instances for their types as well. And they will be annoyed if they their library has to depend on a bunch of other libraries, just so they can provide some instances that only a small fraction of their users might use. So, this method does not scale as the number of 'interesting' classes grows. 3. let individual users define the Serialize instances as they need them. Unfortunately, if two different library authors defined a Serialize instance for Text in their libraries, you could not use both libraries in your application because of the conflicting Serialize instances. So this method does not scale when the number of libraries using the Serialize class grows. Not really sure what the work around is. #1 could work if there was some way to just selectively install the pieces as you need them. But the only way to do this now would be to create a lot of cabal packages which just defined a single instance -- happstack-text, happstack-map, happstack-time, happstack-etc. One for each package that has types we want to create a serialization instance for... Any other suggestions? - jeremy
The only safe rule is: if you don't control the class, C, or you don't control the type constructor, T, don't make instance C T. Application writers can often relax that rule as the set of dependencies for the whole application is known and in many cases any reasonable instance for a class C and constructor T is acceptable. Under those conditions, the worst-case scenario is that the application writer may need to remove an instance declaration when migrating to new versions of the dependencies. When you control a class C, you should make as many (relevant) type constructors instances of it as is reasonably possible, i.e. without adding any extensive dependencies. So at the very least, all standard type constructors. Similarly for those who control a type constructor T. This is for convenience. These correspond to solutions #1 and #2 only significantly weakened. Definitely, making a package depend on tons of other packages just to add instances is NOT the correct solution. The library writers depending on a package for a class and another package for a type are the problem case. There are three potential solutions in this case which basically are reduce the problem to one of the above three cases. Either introduce a new type and add it to a class, introduce a new class and add the types to it, or try to push the resolution of such things onto the application writer. The first two options have the benefit that they also protect you from the upstream libraries introducing instances that won't work for you. These two options have the drawback that they are usually less convenient to use. The last option has the benefit that it usually corresponds to having a more flexible/generic library, in some cases you can even go so far as to remove your dependence on the libraries altogether. One solution to this problem though it can't be done post-hoc usually, is to simply not use the class mechanism except as a convenience. This has the benefit that it usually leads to more flexibility and it helps to realize the third option above. Using Monoid as an example, one can provide functions of the form: f :: m -> (m -> m -> m) -> ... and then also provide f' = f mempty mappend :: Monoid m => ... The parameters can be collected into a record as well. You could even systematize this into: class C a where getCDict :: CDict a, and then write f :: CDict a -> ... and f' = f getCDict :: C a => ... Whatever one does, do NOT add instances of type constructors you don't control to classes you don't control. This can lead to cases where two libraries can't be used together at all.

The only safe rule is: if you don't control the class, C, or you don't control the type constructor, T, don't make instance C T.
I agree in principle, but in the real world you can't live by this rule. Example, I want to use Uniplate to traverse the tree built by haskell-src-exts, Using Data.Data is too slow, so I need to make my own instances. HSE provides like 50 types that need instances, and it has to be exactly those types. Also, Uniplate requires instances of a particular class it has. I don't own either of these packages. Including the HSE instances in Uniplate would just be plain idiotic. Including the Uniplate instances with HSE would make some sense, but would make HSE artificially depend on Uniplate for those who don't want the instances. So, what's left is to make orphan instances (that I own). It's not ideal, but I don't see any alternative to it. -- Lennart

Hi, The problem with Data for Text isn't that we have to write a new instance, but that you could argue that proper handling of Text with Data would not be using a type class, but have special knowledge baked in to Data. That's far worse than the Serialise problem mentioned above, and no one other than the Data authors could solve it. Of course, I don't believe that, but it is a possible interpretation. The Serialise problem is a serious one. I can't think of any good solutions, but I recommend you give knowledge of your serialise class to Derive (http://community.haskell.org/~ndm/derive/) and then at least the instances can be auto-generated. Writing lots of boilerplate and regularly ripping it up is annoying, setting up something to generate it for you reduces the pain.
The only safe rule is: if you don't control the class, C, or you don't control the type constructor, T, don't make instance C T.
I agree in principle, but in the real world you can't live by this rule. Example, I want to use Uniplate to traverse the tree built by haskell-src-exts, Using Data.Data is too slow, so I need to make my own instances. HSE provides like 50 types that need instances, and it has to be exactly those types. Also, Uniplate requires instances of a particular class it has.
Read my recent blog post (http://neilmitchell.blogspot.com/2010/01/optimising-hlint.html), I optimised Uniplate for working with HSE on top of the Data instances - it's now significantly faster in some cases, which may mean you don't need to resort to the Direct stuff. Of course, if you do, then generating them with Derive is the way to go. Thanks, Neil

On Sun, Jan 24, 2010 at 5:49 AM, Neil Mitchell
Hi,
The problem with Data for Text isn't that we have to write a new instance, but that you could argue that proper handling of Text with Data would not be using a type class, but have special knowledge baked in to Data. That's far worse than the Serialise problem mentioned above, and no one other than the Data authors could solve it. Of course, I don't believe that, but it is a possible interpretation.
Right.. that is the problem with Text. Do you think the correct thing to do for gunfold and toConstr is to convert the Text to a String and then call the gufold and toConstr for String? Or something else?
The Serialise problem is a serious one. I can't think of any good solutions, but I recommend you give knowledge of your serialise class to Derive (http://community.haskell.org/~ndm/derive/) and then at least the instances can be auto-generated. Writing lots of boilerplate and regularly ripping it up is annoying, setting up something to generate it for you reduces the pain.
We currently use template haskell to generate the Serialize instances in most cases (though some data types have more optimized encodings that were written by hand). However, you must supply the Version and Migration instances by hand (they are super classes of Serialize). I am all for splitting the Serialize stuff out of happstack .. it is not really happstack specific. Though I suspect pulling it out is not entirely trivial either. I think the existing code depends on syb-with-class. - jeremy

Hi Jeremy, As Neil Mitchell said before, if you really don't want to expose the internals of Text (by just using a derived instance) then you have no other alternative than to use String conversion. If you've been using it already and performance is not a big problem, then I guess it's ok. Regarding the Serialize issue, maybe I am not understanding the problem correctly: isn't that just another generic function? There are generic implementations of binary get and put for at least two generic programming libraries in Hackage [1, 2], and writing one for SYB shouldn't be hard either, I think. Then you could have a trivial way of generating instances of Serialize, namely something like
instance Serialize MyType where getCopy = gget putCopy = gput
and you could provide Template Haskell code for generating these. Or even just do
instance (Data a) => Serialize a where ...
if you are willing to use OverlappingInstances and UndecidableInstances...
Cheers,
Pedro
[1]
http://hackage.haskell.org/packages/archive/regular-extras/0.1.2/doc/html/Ge...
[2]
http://hackage.haskell.org/packages/archive/multirec-binary/0.0.1/doc/html/G...
On Tue, Jan 26, 2010 at 03:16, Jeremy Shaw
On Sun, Jan 24, 2010 at 5:49 AM, Neil Mitchell
wrote: Hi,
The problem with Data for Text isn't that we have to write a new instance, but that you could argue that proper handling of Text with Data would not be using a type class, but have special knowledge baked in to Data. That's far worse than the Serialise problem mentioned above, and no one other than the Data authors could solve it. Of course, I don't believe that, but it is a possible interpretation.
Right.. that is the problem with Text. Do you think the correct thing to do for gunfold and toConstr is to convert the Text to a String and then call the gufold and toConstr for String? Or something else?
The Serialise problem is a serious one. I can't think of any good solutions, but I recommend you give knowledge of your serialise class to Derive (http://community.haskell.org/~ndm/derive/http://community.haskell.org/%7Endm/derive/) and then at least the instances can be auto-generated. Writing lots of boilerplate and regularly ripping it up is annoying, setting up something to generate it for you reduces the pain.
We currently use template haskell to generate the Serialize instances in most cases (though some data types have more optimized encodings that were written by hand). However, you must supply the Version and Migration instances by hand (they are super classes of Serialize).
I am all for splitting the Serialize stuff out of happstack .. it is not really happstack specific. Though I suspect pulling it out is not entirely trivial either. I think the existing code depends on syb-with-class.
- jeremy
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

2010/1/26 José Pedro Magalhães
Hi Jeremy,
As Neil Mitchell said before, if you really don't want to expose the internals of Text (by just using a derived instance) then you have no other alternative than to use String conversion. If you've been using it already and performance is not a big problem, then I guess it's ok.
Regarding the Serialize issue, maybe I am not understanding the problem correctly: isn't that just another generic function? There are generic implementations of binary get and put for at least two generic programming libraries in Hackage [1, 2], and writing one for SYB shouldn't be hard either, I think. Then you could have a trivial way of generating instances of Serialize, namely something like
instance Serialize MyType where getCopy = gget putCopy = gput
But in what package does, instance Serialize Text, live? text? happstack-data? a new package, serialize-text? That is the question at hand. Each of those choices has rather annoying complications. As for using generics, Serialization can not be 100% generic, because we also support migration when the type changes. For example, right now ClockTime is defined: data ClockTime = TOD Integer Integer Let's say that it is later changed to: data ClockTime = TOD Bool Integer Integer Attempting to read the old data you saved would now fail, because the saved data does not have the 'Bool' value. However, perhaps the old data can be migrated by simply setting the Bool to True or False by default. In happstack we would have: $(deriveSerialize ''Old.ClockTime) instance Version Old.ClockTime $(deriveSerialize ''ClockTime) instance Version ClockTime where mode = extension 1 (Proxy :: Proxy Old.ClockTime) instance Migrate Old.ClockTime ClockTime where migrate (Old.TOD i j) = TOD False i j The Version class is a super class of the Serialize class, which is required so that when the deserializer is trying to deserialize ClockTime, and runs across an older version of the data type, it knows how to find the older deserialization function that works with that version of the type, and where to find the migrate function to bring it up to the latest version. - jeremy

Hi
The problem with Data for Text isn't that we have to write a new instance, but that you could argue that proper handling of Text with Data would not be using a type class, but have special knowledge baked in to Data. That's far worse than the Serialise problem mentioned above, and no one other than the Data authors could solve it. Of course, I don't believe that, but it is a possible interpretation.
Right.. that is the problem with Text. Do you think the correct thing to do for gunfold and toConstr is to convert the Text to a String and then call the gufold and toConstr for String? Or something else?
No idea sadly - the SYB stuff was never designed to work with abstract structures, or structures containing strict/unboxed components. Converting the Text to a String should work, so in the absence of any better suggestions, that seems reasonable.
The Serialise problem is a serious one. I can't think of any good solutions, but I recommend you give knowledge of your serialise class to Derive (http://community.haskell.org/~ndm/derive/) and then at least the instances can be auto-generated. Writing lots of boilerplate and regularly ripping it up is annoying, setting up something to generate it for you reduces the pain.
We currently use template haskell to generate the Serialize instances in most cases (though some data types have more optimized encodings that were written by hand). However, you must supply the Version and Migration instances by hand (they are super classes of Serialize). I am all for splitting the Serialize stuff out of happstack .. it is not really happstack specific. Though I suspect pulling it out is not entirely trivial either. I think the existing code depends on syb-with-class.
If you switch to Derive then you can generate the classes with Template Haskell, or run the Derive tool as a preprocessor. Derive abstracts over these details, and also tends to be much easier than working within Template Haskell (which I always find surprisingly difficult). Thanks, Neil

Jeremy Shaw wrote:
Hello,
Would it be possible to get a Data instance for Data.Text.Text? This would allow us to create a Serialize instance of Text for use with happstack -- which would be extremely useful. Last time this came up, I had a look at providing a Data instance for Text, and I "got as far as needing a Data instance for ByteString#, accompanied by an error I don't fully understand, but I think is telling me that things involving magic hashes are magic:
Data/Text/Array.hs:104:35: Couldn't match kind `#' against `*' When matching the kinds of `ByteArray# :: #' and `d :: *' Expected type: d Inferred type: ByteArray# In the first argument of `z', namely `Array' " The problem with a Data instance for Text is that it is using this ByteArray# type, which can't easily interact with the Data type-class because it's a special type. I would suggest providing a Data instance for ByteArray#, but I don't think that's possible either. As far as I can understand it all, your Data instance is probably the closest you are going to get to having a decent Data instance without something else (GHC/SYB) changing significantly. Thanks, Neil.

Hello,
Attached is my new and improved patch to add a Data instance to Data.Text.
The patch just adds:
+-- This instance preserves data abstraction at the cost of inefficiency.
+-- We omit reflection services for the sake of data abstraction.
+
+instance Data Text where
+ gfoldl f z txt = z pack `f` (unpack txt)
+ toConstr _ = error "toConstr"
+ gunfold _ _ = error "gunfold"
+ dataTypeOf _ = mkNoRepType "Data.Text.Text"
Which is based on what the Data instances for Set and Map do:
http://www.haskell.org/ghc/docs/latest/html/libraries/containers-0.3.0.0/src...
http://www.haskell.org/ghc/docs/latest/html/libraries/containers-0.3.0.0/src...
Yay for cargo culting!
It seems like this is better than nothing, possibly the correct answer, and
if someone does decide to add better instances for toConstr and gunfold in
the future, nothing should break? For happstack-data, I think we only need
dataTypeOf.
The instance I posted before definitely did not have valid toConstr /
gunfold instances, so I think we would have noticed if we were actually
trying to use them..
- jeremy
On Fri, Jan 22, 2010 at 4:24 PM, Jeremy Shaw
Hello,
Would it be possible to get a Data instance for Data.Text.Text? This would allow us to create a Serialize instance of Text for use with happstack -- which would be extremely useful.
We (at seereason) are currently using this patch:
http://src.seereason.com/haskell-text-debian/debian/patches/add_Data_instanc...
which basically adds:
+textType = mkStringType "Data.Text" + +instance Data Text where + toConstr x = mkStringConstr textType (unpack x) + gunfold _k z c = case constrRep c of + (CharConstr x) -> z (pack [x]) + _ -> error "gunfold for Data.Text" + dataTypeOf _ = textType +
This particular implementation avoids exposing the internals of the Data.Text type by casting it to a String in toConstr and gunfold. That is similar to how Data is implemented for some numeric types. However, the space usage of casting in Float to a Double is far less than casting a Text to a String, so maybe that is not a good idea?
Alternatively, Data.ByteString just does 'deriving Data'. However, bytestring also exports Data.ByteString.Internal, wheres Data.Text.Internal is not exported.
Any thoughts? I would like to get this handled upstream so that all happstack users can benefit from it.
- jeremy

On Tue, Jan 26, 2010 at 11:52:34AM -0600, Jeremy Shaw wrote:
+ toConstr _ = error "toConstr" + gunfold _ _ = error "gunfold"
Isn't it better to write error "Data.Text.Text: toConstr" Usually I try to do this as we don't get stack traces for _|_. -- Felipe.

On Tue, Jan 26, 2010 at 11:55 AM, Felipe Lessa
On Tue, Jan 26, 2010 at 11:52:34AM -0600, Jeremy Shaw wrote:
+ toConstr _ = error "toConstr" + gunfold _ _ = error "gunfold"
Isn't it better to write
error "Data.Text.Text: toConstr"
Usually I try to do this as we don't get stack traces for _|_.
I think so... none of the other instances do.. but I guess that is not a very good excuse :) - jeremy

Attached.
Thanks!
- jeremy
On Sun, Jan 31, 2010 at 1:34 AM, Bryan O'Sullivan
On Tue, Jan 26, 2010 at 10:08 AM, Jeremy Shaw
wrote: I think so... none of the other instances do.. but I guess that is not a very good excuse :)
Send me a final darcs patch, and I'll apply it.

Hello,
I have attached a new version that should work with GHC 6.10, though I have
not tested it.
The older Data.Data uses mkNorepType instead of mkNoRepType. I just changed
the patch to use the older spelling. In GHC >= 6.12 this will issue a
warning that the old spelling has been deprecated. This seems like a
reasonable fix as long as text drops support for GHC 6.10 before mkNorepType
is completely removed from Data.Data (which may never happen?):
Here is the bug:
http://hackage.haskell.org/trac/ghc/ticket/2760
Also, this patch still won't work with GHC < 6.10, is that ok?
I also noticed in the containers package, there are #ifdefs around the Data
instances:
#if __GLASGOW_HASKELL__
...
#endif
Should I add that as well? Or is text only supported under GHC anyway?
- jeremy
On Tue, Feb 2, 2010 at 12:03 AM, Bryan O'Sullivan
On Mon, Feb 1, 2010 at 12:08 PM, Jeremy Shaw
wrote: Attached.
Data/Text.hs:175:63: Module `Data.Data' does not export `mkNoRepType'
Can you send a followup patch that works against GHC 6.10.4, please?

On Fri, Feb 5, 2010 at 9:33 AM, Jeremy Shaw
I have attached a new version that should work with GHC 6.10, though I have not tested it.
Thanks. I fixed the compilation warning, added a Data instance for lazy Text, and released 0.7.1.0.
participants (9)
-
Bryan O'Sullivan
-
Derek Elkins
-
Felipe Lessa
-
Jeremy Shaw
-
José Pedro Magalhães
-
Lennart Augustsson
-
Neil Brown
-
Neil Mitchell
-
Nicolas Pouillard