Diving into the records swamp (possible GSoC project)

Hi, I am hoping to do a GSoC project this year working on GHC, and have been pointed in the direction of the records issue (in particular, the desire to overload field names). This has been discussed on-and-off for years, and while there are lots of ideas [1], little has been implemented in GHC itself. The plan would be to implement a solution to the "narrow issue" of overloaded field names, along the lines of Simon PJ's SORF proposal (on the wiki). This would provide a basis for experimentation with first-class record types. While there are still design issues to resolve, the broad plan is clear and I'm confident it can be implemented in a summer and without overly restricting future record system designs. Does this sound like a reasonable strategy? I'd appreciate comments and criticism, although arguing about the details of the design should perhaps wait. (A little about me: I'm a PhD student working on type inference for Haskell and dependent types, with about four years' Haskell experience including work on big type-system related projects. I am familiar with the theory behind GHC, but I haven't worked on the code before.) Thanks, Adam Gundry [1] http://hackage.haskell.org/trac/ghc/wiki/Records

Adam Gundry
writes: Hi,
I am hoping to do a GSoC project this year working on GHC, and have been pointed in the direction of the records issue (in particular, the desire to overload field names).
Heck you're brave! Are you sure you want to step into the aggravated issue of changing the dot operator from being function composition? Are you going to use explicit type application? ("The type of get is very odd.") Are you going to handle type-changing update?
The plan would be to implement a solution to the "narrow issue" of overloaded field names, along the lines of Simon PJ's SORF proposal
So has <someone> decided that SORF is the best of those many proposals? I guess it's because it comes with the SPJ ring of confidence? Before jumping to that decision, I suggest you/your sponsors consider the implications of the "NewAxioms" stuff in GHC Head [2] to support 'controlled' overlap. I think overlap is the only extra feature needed to support the DORF or TPDORF proposals. (Plus the syntactic sugar already outlined in that proposal.)
This would provide a basis for experimentation with first-class record types.
No: first-class record types needs much more than the SORF proposal. In particular it needs a way to extend an existing record to make a new one; project a subset of fields; and most important to merge two records with some fields in common avoiding doubling-up those fields (aka Relational Natural Join). The DORF/TPDORF proposals are aimed much better as a step towards first- class record types. [IMO **] Oleg/Ralf's HList paper covers all the ground for first-class records. It depends heavily on overlaps, which is why the NewAxioms stuff would work in really well. AntC [2] http://hackage.haskell.org/trac/ghc/wiki/NewAxioms [**] Declaration of interest: I wrote the DORF and TPDORF proposals.

Slightly, off-topic, but just because I've been spending my last couple of days trying to shoehorn an inheritance-based subytping type system into Haskell (without full OO-power, so no methods or mutable state.)
Oleg/Ralf's HList paper covers all the ground for first-class records. It depends heavily on overlaps, which is why the NewAxioms stuff would work in really well.
I've been kicking around the idea of re-implementing HList on the basis of the new DataKinds [1] extension. I don't know if there'd be much of a need for that, though. More generally, I'm wondering what can be done on extensible records and inheritance-based type systems with FC-pro under one's belt… Here's one thing I don't like about the "current" way HList-based extensible record are represented (and used in OOHaskell [2]): the access time is linear in the number of records a certain type has. Somehow just the thought of "reorder the records in your constructors to make your program go faster" makes me cringe a little. Oleg and Ralf do hint at implementing some binary search tree based technique to alleviate the problem, but I have no idea how to go about it, except extending the compiler. ~ A. References [1] Yorgey, Brent A., et al. "Giving Haskell a promotion." Proceedings of the 8th ACM SIGPLAN workshop on Types in language design and implementation. ACM, 2012. [2] Kiselyov, Oleg, and Ralf Lämmel. "Haskell's overlooked object system." arXiv preprint cs/0509027 (2005).

Hi Aleksandar, This library for extensible records does use -XDataKinds: http://hackage.haskell.org/package/vinyl. It doesn't have as many definitions as HList, but that might be because more recent extensions are more powerful. Many other libraries are listed http://www.haskell.org/haskellwiki/Extensible_record#Libraries_on_hackage. I don't know of a comparison between all of them. -- Adam

Aleksandar Dimitrov
writes:
Hi Aleksandar, I was hoping that Oleg himself would answer the second part of your post, as he did the part re DataKinds:
Here's one thing I don't like about the "current" way HList-based extensible record are represented (and used in OOHaskell [2]): the access time is linear in the number of records a certain type has. Somehow just the thought of "reorder the records in your constructors to make your program go faster" makes me cringe a little.
Yes, it would me! I thought that usually the instance matching for HList happens at compile time(?) So reordering might make compiles faster, but that should be dealt with before run-time(?) I guess it matters whether the 'shape' of the HLists is knowable at compile time, or the records are built 'on the fly' at run time. Here's a possibly relevant idea that I would try for myself if only my day job didn't get in the way of my life so much ;-(. Instead of Type-Indexed Products (TIP's as the HList paper calls them), how about Type-Indexed tuples (TIple's) like this: instance Has (a, b) a where { get (x, _) = x; ... } instance Has (a, b) b ... instance Has (a, b, c) a ... ... instance Has (a, b, c, d) a ... ... (Note that the result type is not an argument to `get'; instead the type is 'pulled' out by the demanding context.) Then access to any element is 'flat' rather than having to walk down the spine of an HList. (Again this needs being able to resolve instances at compile time.) The instances for any n-tuple overlap, and the result of get/set is not confluent. This only matters if the same type appears twice in a tuple. (Contrast that HList uses a 'Lacks' pseudo-constraint/instance failure.) I hope Template Haskell would help with generating the instances for all of the n-tuples -- otherwise it's a lot of boilerplate. The tricky part comes with TIple-level combinations such as extend or append. That might be where NewAxioms overlaps come in to calculate the type of the result. AntC

Hi AntC, Thanks for the feedback! On 26/04/13 09:55, AntC wrote:
Adam Gundry
writes: Hi,
I am hoping to do a GSoC project this year working on GHC, and have been pointed in the direction of the records issue (in particular, the desire to overload field names).
Heck you're brave!
Or possibly stupid. ;-) That's what I'm trying to figure out. There are certainly other, less controversial, things I might work on! Your questions are the kind of detailed issue that I'm tempted to postpone for the time being, although I guess I can state my prejudices:
Are you sure you want to step into the aggravated issue of changing the dot operator from being function composition?
I'd prefer to leave dot alone and add a new operator for record field projection, mostly because I don't relish trying to modify the parser, but I am open to other opinions.
Are you going to use explicit type application? ("The type of get is very odd.")
GHC desperately needs explicit type application, but that's another syntactic minefield; I don't think it matters much for this proposal, in that it's an implementation detail rather than something that should be exposed to the user.
Are you going to handle type-changing update?
I think we need to, though the right solution might well be to punt for now and implement non-overloaded update alone (i.e. require the user to explicitly specify the record type being updated when there is ambiguity). This also works for multiple update, which is an issue otherwise.
The plan would be to implement a solution to the "narrow issue" of overloaded field names, along the lines of Simon PJ's SORF proposal
So has <someone> decided that SORF is the best of those many proposals? I guess it's because it comes with the SPJ ring of confidence?
I should have phrased my message more carefully. By "along the lines of" SORF I didn't mean to exclude adapting it to take into account your work on DORF/TPDORF. As I understand it, they both set out to solve the same basic problem (that Johan outlines so clearly) but differ in the implementation details. I rather meant to contrast the SORF/DORF approach with an approach based solely on name-spacing. Perhaps I should have mentioned that SPJ suggested I look at this, although I don't think we want to be dictatorial about the final result.
Before jumping to that decision, I suggest you/your sponsors consider the implications of the "NewAxioms" stuff in GHC Head [2] to support 'controlled' overlap.
I think overlap is the only extra feature needed to support the DORF or TPDORF proposals. (Plus the syntactic sugar already outlined in that proposal.)
This would provide a basis for experimentation with first-class record types.
No: first-class record types needs much more than the SORF proposal. In particular it needs a way to extend an existing record to make a new one; project a subset of fields; and most important to merge two records with some fields in common avoiding doubling-up those fields (aka Relational Natural Join).
The DORF/TPDORF proposals are aimed much better as a step towards first- class record types. [IMO **]
Oleg/Ralf's HList paper covers all the ground for first-class records. It depends heavily on overlaps, which is why the NewAxioms stuff would work in really well.
While the NewAxioms stuff looks interesting, it's at an early stage. Moreover, I'm worried that a records implementation based on desugaring to an encoding (in the style of HList) will inevitably reveal details of that encoding to the user. Petr rightly points out the need for good error messages: these are much easier to generate if the constraints arising from records are solved in GHC, rather than a library like HList. There is certainly a long way from this proposal to full-blown extensible records, and I don't expect to get there in a summer.
AntC
[2] http://hackage.haskell.org/trac/ghc/wiki/NewAxioms
[**] Declaration of interest: I wrote the DORF and TPDORF proposals.
Thanks again for your efforts. From reading the list archives I can see that a lot of thought and work has gone in to the different proposals, which definitely will inform the final design. It would be good if we could finally make progress on the implementation! Best regards, Adam

Hi Adam,
very nice idea. As the others, I'm curious why you chose to implement SORF
in favor of the other ideas?
I just read the SORF proposal, and I'm a bit concerned about what error
messages would GHC issue when someone would type incorrect code involving
such records. Currently Haskell's error messages already pose a barrier for
newcomers (like "No instance for (Num (a -> a))"), and if records are
converted into those very complicated `Has` instances, type errors would be
probably undecipherable even for moderate skilled Haskell users.
Considering that records are a basic feature of Haskell and something that
people with OOP background are familiar with, this could result in a
feature that would without doubts deter many (if not most) newcomers. So do
you think it would be possible to implement it in such a way that users get
sensible type error messages?
Best regards,
Petr
2013/4/26 Adam Gundry
Hi,
I am hoping to do a GSoC project this year working on GHC, and have been pointed in the direction of the records issue (in particular, the desire to overload field names). This has been discussed on-and-off for years, and while there are lots of ideas [1], little has been implemented in GHC itself.
The plan would be to implement a solution to the "narrow issue" of overloaded field names, along the lines of Simon PJ's SORF proposal (on the wiki). This would provide a basis for experimentation with first-class record types. While there are still design issues to resolve, the broad plan is clear and I'm confident it can be implemented in a summer and without overly restricting future record system designs.
Does this sound like a reasonable strategy? I'd appreciate comments and criticism, although arguing about the details of the design should perhaps wait.
(A little about me: I'm a PhD student working on type inference for Haskell and dependent types, with about four years' Haskell experience including work on big type-system related projects. I am familiar with the theory behind GHC, but I haven't worked on the code before.)
Thanks,
Adam Gundry
[1] http://hackage.haskell.org/trac/ghc/wiki/Records
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi Petr, On 26/04/13 19:53, Petr Pudlák wrote:
Hi Adam,
very nice idea. As the others, I'm curious why you chose to implement SORF in favor of the other ideas?
As I've commented in a message just now [1], by mentioning SORF I didn't mean to exclude taking on board the other proposals (particularly DORF/TPDORF). I do think a type-based approach is the way forward, with a new sort of constraint that gives a polymorphic type to record projection. The plans based on changing the name-spacing rules either need type-based resolution as well, or require field names to be prefixed with the relevant data type (so they don't fully solve the problem). Also, I'm a type theorist, so obviously the type-based solution is the best one. ;-)
I just read the SORF proposal, and I'm a bit concerned about what error messages would GHC issue when someone would type incorrect code involving such records. Currently Haskell's error messages already pose a barrier for newcomers (like "No instance for (Num (a -> a))"), and if records are converted into those very complicated `Has` instances, type errors would be probably undecipherable even for moderate skilled Haskell users. Considering that records are a basic feature of Haskell and something that people with OOP background are familiar with, this could result in a feature that would without doubts deter many (if not most) newcomers. So do you think it would be possible to implement it in such a way that users get sensible type error messages?
You're right to raise this as an issue. I think it is essential to get good error messages, or at least no worse than we have already! Rather than "No instance for Has r l t" we want it to say something like "Type r has no field l". This justifies special treatment of Has constraints by the compiler, rather than simply desugaring to an encoding. Constraint-based type inference means this shouldn't be too difficult. The current description of SORF doesn't really distinguish between the user's perspective and the implementer's. This is something that we should change (so it's clear how much users will need to understand); but perhaps I'll wait and see if my GSoC proposal is accepted first!
Best regards, Petr
Thanks for your comments, Adam [1] http://www.haskell.org/pipermail/haskell-cafe/2013-April/107876.html

Hi Adam, Since we have already had *very* long discussions on this topic, I'm worried that I might open a can of worms be weighing in here, but the issue is important enough to me that I will do so regardless. Instead of endorsing one of the listed proposals directly, I will emphasize the problem, so we don't lose sight of it. The problem people run into *in practice* and complain about in blog posts, on Google+, or privately when we chat about Haskell over beer, is that they would like to write a record definition like this one: data Employee = Employee { id :: Int, name :: String } printId :: Employee -> IO () printId emp = print $ id emp but since that doesn't work well in Haskell today due to name collisions, the best practice today is to instead write something like: data Employee = Employee { employeeId :: Int, employeeName :: String } printId :: Employee -> IO () printId emp = print $ employeeId emp The downsides of the latter have been discussed elsewhere, but briefly they are: * Overly verbose when there's no ambiguity. * Ad-hoc prefix is hard to predict (i.e. sometimes abbreviations of the data type name are used). The important requirement, which might seem a bit obvious, is that any solution to this problem better not be *even more* verbose than the second code snippet above. If I understand the SORF proposal correctly, you would write: data Employee = Employee { id :: Int, name :: String } printId :: Employee -> IO () printId emp = print $ emp.id Is that correct or do you have to replace 'Employee' with 'r { id :: Int }' in the type signature of 'printId'? The discussions about an overhauled record system also involve lots of talk about record sub-typing, extensible records, and other more advanced features. I'd like to point out that there doesn't seem to be a great demand for these features. They might be nice-to-haves or might fall out naturally from a solution to the namespacing problem above, but they are in fact not needed to solve the common problem people have with the Haskell record system. Cheers, Johan

Hi Johan, On 26/04/13 20:46, Johan Tibell wrote:
Hi Adam,
Since we have already had *very* long discussions on this topic, I'm worried that I might open a can of worms be weighing in here, but the issue is important enough to me that I will do so regardless.
I'm the one busily opening this particular can. It's good to know it's an important one though! Thanks for characterising the problem so neatly:
Instead of endorsing one of the listed proposals directly, I will emphasize the problem, so we don't lose sight of it. The problem people run into *in practice* and complain about in blog posts, on Google+, or privately when we chat about Haskell over beer, is that they would like to write a record definition like this one:
data Employee = Employee { id :: Int, name :: String }
printId :: Employee -> IO () printId emp = print $ id emp
but since that doesn't work well in Haskell today due to name collisions, the best practice today is to instead write something like:
data Employee = Employee { employeeId :: Int, employeeName :: String }
printId :: Employee -> IO () printId emp = print $ employeeId emp
The downsides of the latter have been discussed elsewhere, but briefly they are:
* Overly verbose when there's no ambiguity. * Ad-hoc prefix is hard to predict (i.e. sometimes abbreviations of the data type name are used).
The important requirement, which might seem a bit obvious, is that any solution to this problem better not be *even more* verbose than the second code snippet above. If I understand the SORF proposal correctly, you would write:
data Employee = Employee { id :: Int, name :: String }
printId :: Employee -> IO () printId emp = print $ emp.id http://emp.id
Is that correct or do you have to replace 'Employee' with 'r { id :: Int }' in the type signature of 'printId'?
That's correct. The most general type (inferred if the annotation is omitted) will be something like printId :: r { id :: Int } => r -> IO () but you are free to declare a more specific type in the usual way, much as if the constraint was 'Show r', say.
The discussions about an overhauled record system also involve lots of talk about record sub-typing, extensible records, and other more advanced features. I'd like to point out that there doesn't seem to be a great demand for these features. They might be nice-to-haves or might fall out naturally from a solution to the namespacing problem above, but they are in fact not needed to solve the common problem people have with the Haskell record system.
Thanks, I take your point. My proposal is to implement a good solution to the problem you've outlined; I don't think we should go all the way to extensible records just yet, if at all.
Cheers, Johan
All the best, Adam

Johan Tibell
writes: Instead of endorsing one of the listed proposals directly, I will emphasize the problem, so we don't lose sight of it. The problem people run into *in practice* and complain about in blog posts, on Google+, or privately when we chat about Haskell over beer, is that they would like to write a record definition like this one:
data Employee = Employee { id :: Int, name :: String }
printId :: Employee -> IO () printId emp = print $ id emp
but since that doesn't work well in Haskell today due to name collisions, ...
... the best practice today is to instead write something like:
data Employee = Employee { employeeId :: Int, employeeName :: String }
printId :: Employee -> IO () printId emp = print $ employeeId emp
The downsides of the latter have been discussed elsewhere, but briefly
[I've a bit more to say on that record definition below.] Thank you Johan, I agree we should keep clear sight of the problem. So let's be a bit more precise: it's not exactly the record declaration that causes the name collisions, it's the field selector function that gets created automatically. (Note that we can use xDisambiguateRecordFields to access fields to, errm, disambiguate.) So I did put in a separate proposal [3] (and ticket) on that very narrow issue. (Simon M pointed out that I probably didn't name it very well!) Even if we do nothing to advance the "records swamp", PLEASE can we provide a compiler option to suppress that function. I envisage it might facilitate a 'cottage industry' of Template Haskell solutions (generating Has instances), which would be a cheap and cheerful way to experiment in the design space. [3] http://hackage.haskell.org/trac/ghc/wiki/Records/DeclaredOverloadedRecordFi elds/NoMonoRecordFields (There are bound to be some fishhooks, especially around export/import of names from a module with no selector functions to one that's expecting them.) [cont from above] they are:
* Overly verbose when there's no ambiguity. * Ad-hoc prefix is hard to predict (i.e. sometimes abbreviations of the
data type name are used). I don't entirely agree with your analysis. * fields named `id' or `name' are very likely to clash, so that's a bad design (_too_ generic). * If you've normalised your data model [**], you are very likely to want exactly the same field in several records (for example employeeId in EmployeeNameAddress, and in EmployeePay and in EmployeeTimeSheet.) [And this use case is what TP/DORF is primarily aimed at.] [**] Do I need to explain what data model normalisation is? I fear that so- called XML 'databases' mean academics don't get taught normalisation any more(?) AntC

Johan Tibell
writes: The discussions about an overhauled record system also involve lots of talk about record sub-typing, extensible records, and other more advanced features. I'd like to point out that there doesn't seem to be a great demand for these features. ...
Sorry, Johan, I really have to disagree with that. There's lot's of Haskell to SQL interfaces that build on HList and its extensible record ideas (HDBC for example). But the usability is not good (as Petr points out, and as Oleg/Ralf admitted back in the paper). The type error messages are long and obscure.
... They might be nice-to-haves or might fall out naturally from a solution to the namespacing problem above, but they are in fact not needed to solve the common problem people have with the Haskell record system.
"the common problem people have" is that the record system is unusable [IMO] so doesn't get 'stretched' to see what other difficulties it has. There are all sorts of alternative systems (including Lenses) built with Template Haskell (and chewing gum and gaffer tape: that's how desperately bad is the current situation ;-). I'm saying that many people find the Haskell record system 'as is' so dysfunctional that they give up on it! I feel strongly that as soon as we get past the name collissions, there'll be other blockages to using it. I'd be interested to hear if there are any who can remember the Trex system, and how (un)usable it was? AntC
participants (6)
-
Adam Gundry
-
adam vogt
-
Aleksandar Dimitrov
-
AntC
-
Johan Tibell
-
Petr Pudlák