A backwards-compatible record proposal

Hi - As I've been writing a Haskell program over the past few months the main problem I encounter is that record field names are not local to the record type, and any systematic way of making them local (eg by prepending "_Tycon_") results in names that are just too clunky, and I feel that identifiers *must* use systematic naming conventions to get code that will be easy to understand and maintain. Although it is relatively easy to think up a better record system, it has taken me up till now to discover how such a system could be integrated with Haskell as it is at the moment, since it's unlikely that the existing record system will ever disappear, at least not in the next few years, and in fact, there are some good points about the existing record system that I wouldn't like to lose. I'll motivate the proposal then talk about how it could be implemented. Motivation ======= Consider the following: data Vector3 a = Vector3{x :: a, y::a, z::a } data Normal3 a = Normal3{x :: a, y::a, z::a } We've got a problem because (x) has been introduced twice to the top level namespace. In the above example, it could be argued that I should have written: data Arr3 a = Arr3 {x::a, y::a, z::a} newtype Vector3 a = Vector3 (Arr3 a) newtype Normal3 a = Normal3 (Arr3 a) but for the sake of argument, let's suppose we can't do this - perhaps one of the record types has some different fields as well. A related problem is suppose I have: data Size = {width :: Int, height :: Int} data Rect = Rect{x1, y1, x2, y2::Int} width :: Rect -> Int width Rect{x1,_,x2,_} = x2 - x1 because a record field of Size has the same name as a top level function. This second conflict can be avoided by always using the rule that record fields begin with an underscore and all other variables don't. So applying this to the first problem, we have: data Vector3 a = Vector3{_x :: a, _y::a, _z::a } data Normal3 a = Normal3{_x :: a, _y::a, _z::a } Of course we haven't solved it yet! But now suppose we introduce a new piece of syntactic sugar, and write: data Vector3 a = Vector3{.x :: a, .y::a, .z::a } data Normal3 a = Normal3{.x :: a, .y::a, .z::a } ie putting a '.' before each field name. The intended meaning is that dotted field names do *not* generate top level functions. Instead they allow the compiler to generate instance decls as follows, where we've introduced a new form of identifier, the dotted id, which behaves as a postfix operator which binds more tightly than function application and can also be used as a class name (by the compiler only): class (.x) :: a b | a -> b where (.x) :: a -> b class (.y) :: a b | a -> b where (.y) :: a -> b class (.z) :: a b | a -> b where (.z) :: a -> b For each dotted id, there is a class defined as above, which is available globally to the whole program as if a module containing an infinite set of class decls as above was exported by the Prelude. In the module containing the data decl for the record, the compiler inserts the following: instance (.x) (Vector3 a) a where (.x) v = ... -- compiler generated code to access the field Then within the rest of the program we can write: magSquared :: Num a => Vector3 a -> a magSquared v = v.x*v.x + v.y*v.y + v.z*v.z -- explicit type when specific function is required vec_x = ((.x) :: (Vector3 a -> a)) The advantage of this proposal is that it is completely backwards compatible with records as they are at the moment, and we can choose which fields we want to be dotted and which we want to just keep as normal top level functions. The only extra thing we need to do is put a dot before the field names we want to access via the dotted syntax, but we in any case needed to use an underscore when we wanted a systematic way to avoid conflicts between field names and other top-level names so there is no extra effort involved. Implementation ========== The above could almost be implemented just by parsing a source file containing uses of dotted fields and using a conversion like: data Rec a = Rec { .f :: a} let rec = Rec {.f = 78} -- dot is used here too p = rec.f * rec.f ==> data Rec a = Rec a instance Dot__f (Rec a) a where __dot_f (Rec a) = a let rec = Rec 78 p = __dot_f rec * __dot_f rec So far so good, but the alert reader :-) will have noticed that we now have a *major* problem with abstraction because although we can write: module M (Rec) where ... code in another module can still say rec.f because in Haskell, all instance decls in a module are always exported [1]. This problem may disappear in Haskell' [2]. http://hackage.haskell.org/trac/haskell-prime/ticket/19 Therefore I think the desugaring would need to take place in the compiler so the compiler could avoid exporting the compiler-generated instances when the fields are not present in the module export list. Regards, Brian. [1] http://haskell.org/onlinereport/modules.html#import-instances [2] http://hackage.haskell.org/trac/haskell-prime/ticket/19 -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com

On Sat, Aug 19, 2006 at 09:21:34AM +0100, Brian Hulley wrote:
Therefore I think the desugaring would need to take place in the compiler so the compiler could avoid exporting the compiler-generated instances when the fields are not present in the module export list.
I'm not entirely sure I understand you here, but something to consider is how well the sytem can be handled by something which is not a compiler, but a code transformer. Something like Hat or buddha. For tools like these, it is important that a desugared program is still a valid (source level) program. (Haskell 98 is close to this, but not 100% - unfortunately). Desugaring sometimes introduces new code into a program (for instance the deriving rules), a program transforming tool will most likely have to apply its transformation to that introduced code. Therefore, to transform a Haskell program you have to desugar it (somewhat) first. We want the result to remain a valid Haskell program, so it can be accepted by an ordinary compiler. (Please forgive me if your scheme already allows this). Cheers, Bernie.

Bernard James POPE wrote:
On Sat, Aug 19, 2006 at 09:21:34AM +0100, Brian Hulley wrote:
Therefore I think the desugaring would need to take place in the compiler so the compiler could avoid exporting the compiler-generated instances when the fields are not present in the module export list.
I'm not entirely sure I understand you here,
For example, module M (Rec) where data Rec a = Rec {f :: a} means that the components of Rec are visible in module M but not from any other module ie Rec is an abstract data type. However if a dotted field was used, as in: module M (Rec) where data Rec a = Rec {.f :: a} we'd also like the fields of Rec to be inaccessible from outside, but since (.f) is a global typeclass, and module M contains an instance of (.f) (Rec a) a, and since any module which imports M always sees all instances that M sees, in particular it would see the instance (.f) (Rec a) a and therefore be able to inspect values of (Rec a) thus breaking the module abstraction barrier. The problem is that the inability to prevent instances from being exported from a module breaks the abstraction, so all though everything in the proposal can be desugared into plain Haskell (with MPTC/FD) the caveat is that we would lose abstraction with this simple method. However I think it could be solved by a more complex desugaring: module M (Rec, use) where data Rec a = Rec {f :: a} use :: Rec a -> (a,a) use r = (r.f, r.f) by introducing a newtype and adding wrappers to functions as follows: module M (Rec, use) where import DotClasses.Dot_f -- every class has its own module (*) data Rec' a = Rec a newtype Rec a = Rec (Rec' a) instance Dot__f (Rec' a) a where __dot_f (Rec' x) = x use :: Rec a -> (a,a) use (Rec r) = use' r use' :: Rec' a -> (a,a) use' r = (r.f, r.f) All code in the source module that uses Rec is renamed to syntactically isomorphic code using Rec' instead, and exported functions are replaced by wrappers to renamed versions of the original functions. Any code outside the module can only see Rec and not Rec' so the abstraction is preserved.
but something to consider is how well the sytem can be handled by something which is not a compiler, but a code transformer. Something like Hat or buddha.
For tools like these, it is important that a desugared program is still a valid (source level) program. (Haskell 98 is close to this, but not 100% - unfortunately).
Desugaring sometimes introduces new code into a program (for instance the deriving rules), a program transforming tool will most likely have to apply its transformation to that introduced code. Therefore, to transform a Haskell program you have to desugar it (somewhat) first. We want the result to remain a valid Haskell program, so it can be accepted by an ordinary compiler.
With the more complex desugaring described above the result would be ordinary Haskell code (though not H98 due to MPTC and FD for the Dot_* classes). (*) Since it might be unwieldy to try to build a single module containing all the dot classes used in an entire program and all libraries, a simple solution is to just have a separate directory to store dot classes then the rule would be when the desugaring tool encounters a dotted field in a data decl, it would check to see if the module DotClasses.Dot_f already exists, and if it doesn't, then it would create the module: module DotClasses.Dot_f where class Dot_f a b | a -> b __dot_f :: a -> b Each dotted field that appears in any code in a module would give rise to an import DotClasses.Dot_ directive. (If the module used libraries that were not also desugared then the "ensure Dot_ module created" operation would have to be done for all dotted fields used anywhere in the module, not just in data decls.) Thus there appears to be a fairly straightforward algorithm for desugaring to code that could be readily accepted without needing to change the compiler (or other tools) after all. Of course the usefulness of error messages etc would be improved if the tools could deal with unsugared source. Best regards, Brian. -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com

Ooops! ;-) Brian Hulley wrote:
module M (Rec, use) where
import DotClasses.Dot_f -- every class has its own module (*)
data Rec' a = Rec a newtype Rec a = Rec (Rec' a)
instance Dot__f (Rec' a) a where
instance Dot_f (Rec' a) a where
__dot_f (Rec' x) = x
use :: Rec a -> (a,a) use (Rec r) = use' r
use' :: Rec' a -> (a,a) use' r = (r.f, r.f)
use' r = (__dot_f r, __dot_f r)

Brian Hulley wrote:
However I think it could be solved by a more complex desugaring:
The proposed desugarings allow us to either make all dotted fields in a record visible, or none of them visible, but I don't think there exists a desugaring that would allow some to be visible while others were hidden. However this wouldn't be a problem because we could just have the rule that if people want to use dotted fields, they must either export all the fields in the record or no fields in the record (including non-dotted fields). This restriction could later be relaxed when tools/compilers etc implemented dotted fields directly. I don't think it's that important as I've never encountered a situation where I wanted to make only part of a record visible - I suppose I'm just an "all or nothing" person ;-) Apologies for the multiple posts, Brian.

On 8/19/06, Brian Hulley

Gene A wrote:
On 8/19/06, Brian Hulley
wrote: {... magSquared v = v.x*v.x + v.y*v.y + v.z*v.z ...} Hi, Won't the use of the "dot" lend confusion to the eye of the beholder.. that as in the code fragment about that v.y or v.z is implying function composition .... I'll admit to being pretty new to Haskell, but that is what it would look like to me. Could, I think cause some confusion to others reading a program with this construct, and might fool some of the tools that some others have mentioned.. Would seem that parser could or would make that mistake?
Hi Gene, In v.x or v .x then ".x" is a single lexeme, whereas in v . x or v. x the "v" and "x" are ids and the "." is a symbol. In other words, the parser sees: v.x [VarId, DottedField] v .x [VarId, DottedField] v. x [VarId, VarSym, VarId] v . x [VarId, VarSym, VarId] This works because the lexer just obeys the "maximal munch" rule ie reading from left to right eating up as many characters as possible to form each lexeme. It's probably slightly confusing when seen as plain text, but if you used an editor that fontifies VarId's differently from DottedField's, the difference would be easily visible. I think it would also become quite natural just as we already have: 123.42 vs 123 . 42 A.B.C.p vs A.B.C . p As you pointed out "f.g" at the moment means function composition. However there seems afaics to be an informal convention that spaces are always placed around the dot when used as an operator, since of course "F.g" means "the g in module F" as opposed to "F . g". Therefore my proposal is not entirely backwards compatible, though there could perhaps be a compiler flag to prevent old code from being broken, or a tool to insert the required spaces into old code. Best regards, Brian. -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com

Hello Brian, Saturday, August 19, 2006, 12:21:34 PM, you wrote:
ie putting a '.' before each field name. The intended meaning is that dotted field names do *not* generate top level functions. Instead they allow the compiler to generate instance decls as follows, where we've introduced a new form of identifier, the dotted id, which behaves as a postfix operator which binds more tightly than function application and can also be used as a class name (by the compiler only):
class (.x) :: a b | a -> b where (.x) :: a -> b
this means that foo.bar should be parsed differently by _lexer_ depending on is there any .bar field available in current or any imported module How about omitting '.' and using '#' operation for record access? the only problem will that '#' should have larger priority than ' '. i.e. function application. I had proposal on it, but it's too daring - raise priority of operations when they are written without spaces around, i.e. object#call x+y t#field z*2+1 treated as (object#call) (x+y) (t#field) (z*2+1) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Brian,
Saturday, August 19, 2006, 12:21:34 PM, you wrote:
ie putting a '.' before each field name. The intended meaning is that dotted field names do *not* generate top level functions. Instead they allow the compiler to generate instance decls as follows, where we've introduced a new form of identifier, the dotted id, which behaves as a postfix operator which binds more tightly than function application and can also be used as a class name (by the compiler only):
class (.x) :: a b | a -> b where (.x) :: a -> b
this means that foo.bar should be parsed differently by _lexer_ depending on is there any .bar field available in current or any imported module
Hi Bulat, I just assumed that foo.bar would always be lexed as "foo" ".bar" so if composition was intended there would need to be a space between the "." and the "bar", but this breaks backwards compatibility (I hadn't thought of this when I made the proposal because I assumed everyone always writes "f . g" and not "f.g" when they mean composition but of course the latter is currently parsed as composition also).
How about omitting '.' and using '#' operation for record access? the only problem will that '#' should have larger priority than ' '. i.e. function application. I had proposal on it, but it's too daring - raise priority of operations when they are written without spaces around, i.e.
object#call x+y t#field z*2+1
treated as
(object#call) (x+y) (t#field) (z*2+1)
Intuitively this makes sense, because you'd expect no spaces to mean "glue this together as tight as possible", but on the other hand I personally find the second version with brackets is visually much easier to read even though it's a bit more inconvenient to write. But 1) The first version is not backwards compatible either. 2) object#call should not be treated as (object # call) because it means ((#call) object) so whatever symbol is used for field selection, it needs to be glued onto the field name by the lexer not the parser. Also I was hoping to be able to use the "." so that field access would follow the C, C#, Java conventions and because # is used by ghc to mean "unboxed" so the following might be confusing: object#call 1# x#y # k (#g,a#) object.call 1# x.y # k (#g,a#) ie ((object.call) 1# (x.y)) # (k (# g, a #)) Best regards, Brian -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com
participants (4)
-
Bernard James POPE
-
Brian Hulley
-
Bulat Ziganshin
-
Gene A