Re: [Haskell-cafe] Call for discussion: OverloadedLists extension

Hi Michael, Here at the University of Tübingen, I am co-supervising (together with Jeroen Weijers) a student project implementing the OverloadedLists extension for GHC. Achim Krause is the student who is working on the project. We took into consideration earlier discussions on this topic [1,2] before embarking on the project. Achim has worked on two approaches. The first approach is very simple, both from the user's and the extension implementor's perspective (it follows the implementation of OverloadedStrings closely) and typechecks and desugars lists like [] ; [x,y,z] ; ['a' .. 'z'] ; as fromList [] ; fromList [x,y,z] ; fromList ['a' .. 'z'] ; where fromList is whatever is in scope with that name. That said, we do provide the FromList type class that can be used to overload fromList. In the following I give the definition of the class, as well as, example instances: class FromList l where type Item l fromList :: [Item l] -> l instance FromList [a] where type Item [a] = a fromList = id instance (Ord a) => FromList (Set a) where type Item (Set a) = a fromList = Set.fromList instance (Ord k) => FromList (Map k v) where type Item (Map k v) = (k,v) fromList = Map.fromList instance FromList (IntMap v) where type Item (IntMap v) = (Int,v) fromList = IntMap.fromList instance FromList Text where type Item Text = Char fromList = Text.pack This approach has already been implemented by Achim as patch against GHC head. This approach is very simple, but can be inefficient as it may result into unnecessary construction of lists at runtime. This can be a serious issue when constructing large structures from arithmetic sequences (e.g., from the [ .. ] notation) or when using non-literal expressions (e.g., variables) inside the square brackets. Our second approach to OverloadedLists is to avoid the construction of lists altogether. By typechecking and desugaring lists like [] ; [x,y,z] ; ['a' .. 'z'] ; as mempty ; singleton x `mappend` singleton y `mappend` singleton z ; genericEnumFromTo 'a' 'z' ; We provide the Singleton and GenericEnum type classes for overloading singleton and genericEnum(..) functions. In the following, I give the definitions of the classes, as well as, example instances: -- Singleton class class Singleton l where type SingletonItem l singleton :: SingletonItem l -> l -- Singleton instances instance Singleton [a] where type SingletonItem [a] = a singleton a = [a] instance (Ord a) => Singleton (Set a) where type SingletonItem (Set a) = a singleton = Set.singleton instance (Ord k) => Singleton (Map k v) where type SingletonItem (Map k v) = (k,v) singleton (k,v) = Map.singleton k v instance Singleton (IntMap v) where type SingletonItem (IntMap v) = (Int,v) singleton (k,v) = IntMap.singleton k v instance Singleton Text where type SingletonItem Text = Char singleton = Text.singleton -- GenericEnum class class GenericEnum l where type EnumItem l genericEnumFrom :: EnumItem l -> l genericEnumFromThen :: EnumItem l -> EnumItem l -> l genericEnumFromTo :: EnumItem l -> EnumItem l -> l genericEnumFromThenTo :: EnumItem l -> EnumItem l -> EnumItem l -> l -- GenericEnum instances instance (Enum a) => GenericEnum [a] where type EnumItem [a] = a genericEnumFrom = enumFrom genericEnumFromThen = enumFromThen genericEnumFromTo = enumFromTo genericEnumFromThenTo = enumFromThenTo instance (Ord a,Enum a) => GenericEnum (Set a) where type EnumItem (Set a) = a genericEnumFrom a = Set.fromList (enumFrom a) genericEnumFromThen a b = Set.fromList (enumFromThen a b) genericEnumFromTo a b = Set.fromList (enumFromTo a b) genericEnumFromThenTo a b c = Set.fromList (enumFromThenTo a b c) instance (Ord k,Enum (k,v)) => GenericEnum (Map k v) where type EnumItem (Map k v) = (k,v) genericEnumFrom a = Map.fromList (enumFrom a) genericEnumFromThen a b = Map.fromList (enumFromThen a b) genericEnumFromTo a b = Map.fromList (enumFromTo a b) genericEnumFromThenTo a b c = Map.fromList (enumFromThenTo a b c) instance (Enum (Int,v)) => GenericEnum (IntMap v) where type EnumItem (IntMap v) = (Int,v) genericEnumFrom a = IntMap.fromList (enumFrom a) genericEnumFromThen a b = IntMap.fromList (enumFromThen a b) genericEnumFromTo a b = IntMap.fromList (enumFromTo a b) genericEnumFromThenTo a b c = IntMap.fromList (enumFromThenTo a b c) instance GenericEnum Text where type EnumItem Text = Char genericEnumFrom a = Text.pack (enumFrom a) genericEnumFromThen a b = Text.pack (enumFromThen a b) genericEnumFromTo a b = Text.pack (enumFromTo a b) genericEnumFromThenTo a b c = Text.pack (enumFromThenTo a b c) Note that the GenericEnum instances can be implemented more efficiently, but for now I give simple definitions that go through lists. Our second approach avoids the construction of intermediate lists at runtime and directly constructs the target data structure for which the list notation is used. We will release GHC patches for both approaches, meanwhile the feedback from the community on the approaches that we took would be very much appreciated. Which one those would you prefer? or would you suggest a different one. Note that we intend to make fromList in the first approach and singleton, genericEnum(..), mempty and mapped rebindable. This means that the definitions of the type classes that overload this functions can be easily changed. Having said that, altering the changes that Achim already made to the GHC source code (including typechecking and desugaring rules) will be more work and we hope that one of the approaches that we took will be acceptable for the community. Cheers, George [1] http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg20447.html [2] http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg20518.html

On Mon, Sep 24, 2012 at 2:53 PM, George Giorgidze
Hi Michael,
Here at the University of Tübingen, I am co-supervising (together with Jeroen Weijers) a student project implementing the OverloadedLists extension for GHC. Achim Krause is the student who is working on the project. We took into consideration earlier discussions on this topic [1,2] before embarking on the project.
Achim has worked on two approaches.
The first approach is very simple, both from the user's and the extension implementor's perspective (it follows the implementation of OverloadedStrings closely) and typechecks and desugars lists like
[] ; [x,y,z] ; ['a' .. 'z'] ;
as
fromList [] ; fromList [x,y,z] ; fromList ['a' .. 'z'] ;
where fromList is whatever is in scope with that name. That said, we do provide the FromList type class that can be used to overload fromList. In the following I give the definition of the class, as well as, example instances:
class FromList l where type Item l fromList :: [Item l] -> l
instance FromList [a] where type Item [a] = a fromList = id
instance (Ord a) => FromList (Set a) where type Item (Set a) = a fromList = Set.fromList
instance (Ord k) => FromList (Map k v) where type Item (Map k v) = (k,v) fromList = Map.fromList
instance FromList (IntMap v) where type Item (IntMap v) = (Int,v) fromList = IntMap.fromList
instance FromList Text where type Item Text = Char fromList = Text.pack
This approach has already been implemented by Achim as patch against GHC head.
This approach is very simple, but can be inefficient as it may result into unnecessary construction of lists at runtime. This can be a serious issue when constructing large structures from arithmetic sequences (e.g., from the [ .. ] notation) or when using non-literal expressions (e.g., variables) inside the square brackets.
Our second approach to OverloadedLists is to avoid the construction of lists altogether. By typechecking and desugaring lists like
[] ; [x,y,z] ; ['a' .. 'z'] ;
as
mempty ; singleton x `mappend` singleton y `mappend` singleton z ; genericEnumFromTo 'a' 'z' ;
We provide the Singleton and GenericEnum type classes for overloading singleton and genericEnum(..) functions. In the following, I give the definitions of the classes, as well as, example instances:
-- Singleton class
class Singleton l where type SingletonItem l singleton :: SingletonItem l -> l
-- Singleton instances
instance Singleton [a] where type SingletonItem [a] = a singleton a = [a]
instance (Ord a) => Singleton (Set a) where type SingletonItem (Set a) = a singleton = Set.singleton
instance (Ord k) => Singleton (Map k v) where type SingletonItem (Map k v) = (k,v) singleton (k,v) = Map.singleton k v
instance Singleton (IntMap v) where type SingletonItem (IntMap v) = (Int,v) singleton (k,v) = IntMap.singleton k v
instance Singleton Text where type SingletonItem Text = Char singleton = Text.singleton
-- GenericEnum class
class GenericEnum l where type EnumItem l genericEnumFrom :: EnumItem l -> l genericEnumFromThen :: EnumItem l -> EnumItem l -> l genericEnumFromTo :: EnumItem l -> EnumItem l -> l genericEnumFromThenTo :: EnumItem l -> EnumItem l -> EnumItem l -> l
-- GenericEnum instances
instance (Enum a) => GenericEnum [a] where type EnumItem [a] = a genericEnumFrom = enumFrom genericEnumFromThen = enumFromThen genericEnumFromTo = enumFromTo genericEnumFromThenTo = enumFromThenTo
instance (Ord a,Enum a) => GenericEnum (Set a) where type EnumItem (Set a) = a genericEnumFrom a = Set.fromList (enumFrom a) genericEnumFromThen a b = Set.fromList (enumFromThen a b) genericEnumFromTo a b = Set.fromList (enumFromTo a b) genericEnumFromThenTo a b c = Set.fromList (enumFromThenTo a b c)
instance (Ord k,Enum (k,v)) => GenericEnum (Map k v) where type EnumItem (Map k v) = (k,v) genericEnumFrom a = Map.fromList (enumFrom a) genericEnumFromThen a b = Map.fromList (enumFromThen a b) genericEnumFromTo a b = Map.fromList (enumFromTo a b) genericEnumFromThenTo a b c = Map.fromList (enumFromThenTo a b c)
instance (Enum (Int,v)) => GenericEnum (IntMap v) where type EnumItem (IntMap v) = (Int,v) genericEnumFrom a = IntMap.fromList (enumFrom a) genericEnumFromThen a b = IntMap.fromList (enumFromThen a b) genericEnumFromTo a b = IntMap.fromList (enumFromTo a b) genericEnumFromThenTo a b c = IntMap.fromList (enumFromThenTo a b c)
instance GenericEnum Text where type EnumItem Text = Char genericEnumFrom a = Text.pack (enumFrom a) genericEnumFromThen a b = Text.pack (enumFromThen a b) genericEnumFromTo a b = Text.pack (enumFromTo a b) genericEnumFromThenTo a b c = Text.pack (enumFromThenTo a b c)
Note that the GenericEnum instances can be implemented more efficiently, but for now I give simple definitions that go through lists.
Our second approach avoids the construction of intermediate lists at runtime and directly constructs the target data structure for which the list notation is used.
We will release GHC patches for both approaches, meanwhile the feedback from the community on the approaches that we took would be very much appreciated. Which one those would you prefer? or would you suggest a different one.
Note that we intend to make fromList in the first approach and singleton, genericEnum(..), mempty and mapped rebindable. This means that the definitions of the type classes that overload this functions can be easily changed. Having said that, altering the changes that Achim already made to the GHC source code (including typechecking and desugaring rules) will be more work and we hope that one of the approaches that we took will be acceptable for the community.
Cheers, George
[1] http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg20447.html [2] http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg20518.html
Hi George, It's very exciting to hear that this work has already been starting, thank you for letting me know. Your first approach is more inline with my initial proposal, though that doesn't mean I necessarily prefer it. I'm certainly interested having a more efficient implementation available, but I wonder if this second approach isn't a premature optimization. GHC rewrite rules provide a lot of power in this department: both ByteString and Text are able to avoid the intermediate String and build their representations from the buffers GHC creates. In my own work in conduit, I was able (with some help from Joachim Breitner[1]) to remove the intermediate list structure, and I'd be surprised if it's not possible to do the same thing with calls to `fromList`. That said, I haven't actually tried implementing any of this in real code, and it could be that there are serious performance advantages to the second approach. I'm quite happy with either implementation making it into GHC. Michael [1] http://www.haskell.org/pipermail/haskell-cafe/2012-April/100793.html

| Here at the University of Tübingen, I am co-supervising (together with | Jeroen Weijers) a student project implementing the OverloadedLists | extension for GHC. Achim Krause is the student who is working on the | project. We took into consideration earlier discussions on this topic | [1,2] before embarking on the project. | | Achim has worked on two approaches. Your second approach is this: | [x,y,z] | | as | | singleton x `mappend` singleton y `mappend` singleton z ; This approach is not good for long literal lists, because you get tons of executable code where the user thought he was just defining a data structure. And long literal lists are an important use-case. One other possibility is to use a variant of what GHC does for literal strings. Currently "foo" turns into unpackCString "foo"# where "foo"# is a statically allocate C string, and the "unpackCString" unpacks it lazily. Maybe we could make a literal [a,b.c] turn into unpack [a,b,c]# where [a,b,c]# is a statically-allocated vector? See http://hackage.haskell.org/trac/ghc/ticket/5218, which is stalled awaiting brain cycles from someone. I'm maxed out at the moment. I'd be very happy if you guys were able to make progress; I'm happy to advise. Open a ticket, start a wiki page, etc! Simon

So, in order not to have to rely on rewrite rules, would it be a good idea to add unpackCString to the IsString class?
import GHC.Base (unpackCString#, Addr#)
class IsString a where
fromString :: String -> a
unpackCString :: Addr# -> a
unpackCString addr = fromString (unpackCString# addr)
For lists something similar could probably be done.
Sjoerd
On Sep 25, 2012, at 10:01 AM, Simon Peyton-Jones
| Here at the University of Tübingen, I am co-supervising (together with | Jeroen Weijers) a student project implementing the OverloadedLists | extension for GHC. Achim Krause is the student who is working on the | project. We took into consideration earlier discussions on this topic | [1,2] before embarking on the project. | | Achim has worked on two approaches.
Your second approach is this:
| [x,y,z] | | as | | singleton x `mappend` singleton y `mappend` singleton z ;
This approach is not good for long literal lists, because you get tons of executable code where the user thought he was just defining a data structure. And long literal lists are an important use-case.
One other possibility is to use a variant of what GHC does for literal strings. Currently "foo" turns into unpackCString "foo"# where "foo"# is a statically allocate C string, and the "unpackCString" unpacks it lazily.
Maybe we could make a literal [a,b.c] turn into unpack [a,b,c]# where [a,b,c]# is a statically-allocated vector? See http://hackage.haskell.org/trac/ghc/ticket/5218, which is stalled awaiting brain cycles from someone.
I'm maxed out at the moment. I'd be very happy if you guys were able to make progress; I'm happy to advise. Open a ticket, start a wiki page, etc!
Simon
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 9/25/12 1:57 PM, Sjoerd Visscher wrote:
Maybe we could make a literal [a,b,c] turn into unpack [a,b,c]# where [a,b,c]# is a statically-allocated vector?
I'm kinda surprised this isn't already being done. Just doing this seems like it'd be a good undertaking, regardless of whether we get overloaded list literals. Just storing the literal as a C-like array and inflating it to a list/array/vector at runtime seems like it should be a big win for code that uses a lot of literals. -- Live well, ~wren

On 12-09-26 08:07 PM, wren ng thornton wrote:
On 9/25/12 1:57 PM, Sjoerd Visscher wrote:
Maybe we could make a literal [a,b,c] turn into unpack [a,b,c]# where [a,b,c]# is a statically-allocated vector?
I'm kinda surprised this isn't already being done. Just doing this seems like it'd be a good undertaking, regardless of whether we get overloaded list literals. Just storing the literal as a C-like array and inflating it to a list/array/vector at runtime seems like it should be a big win for code that uses a lot of literals.
Why? I'm surprised that this is an issue at all. If list literals you are talking about are constant, wouldn't GHC apply constant folding and construct the list only the first time it's needed?

On 9/28/12 2:48 PM, Mario Blažević wrote:
On 12-09-26 08:07 PM, wren ng thornton wrote:
On 9/25/12 1:57 PM, Sjoerd Visscher wrote:
Maybe we could make a literal [a,b,c] turn into unpack [a,b,c]# where [a,b,c]# is a statically-allocated vector?
I'm kinda surprised this isn't already being done. Just doing this seems like it'd be a good undertaking, regardless of whether we get overloaded list literals. Just storing the literal as a C-like array and inflating it to a list/array/vector at runtime seems like it should be a big win for code that uses a lot of literals.
Why?
I'm surprised that this is an issue at all. If list literals you are talking about are constant, wouldn't GHC apply constant folding and construct the list only the first time it's needed?
The problem is: if the list is stored naively in the .data segment (as apparently it is), then we have to store all the pointer structure as well as the data. This hugely bloats the disk footprint for programs. That is, all the reasons why String=[Char] is bad at runtime are also reasons why this representation is bad at objectcode time. For most lists, the pointer structure is a considerable portion of the total memory cost. During runtime this overhead is (or at least may be) unavoidable due to the dynamic nature of program execution; but there's no reason to have this overhead in the compiled format of the program since it's trivial to generate it from a compact representation (e.g., storing lists as C-style arrays + lengths). The only conceivable benefit of storing lists on disk using their heap representation is to allow treating the .data segment as if it were part of the heap, i.e., to have zero-cost inflation and to allow GC to ignore that "part of the heap". However, for lists, I can't imagine this actually being beneficial in practice. This sort of thing is more beneficial for large structures of static data (e.g., sets, maps,...). But then for large static data, we still usually want a non-heap representation (e.g., cache-oblivious datastructures), since we're liable to only look at the data rather than to change it. It's only when we have lots of static "mutable" data that it makes sense to take heap snapshots. -- Live well, ~wren

On Mon, Sep 24, 2012 at 5:53 AM, George Giorgidze
Our second approach to OverloadedLists is to avoid the construction of lists altogether. By typechecking and desugaring lists like
[] ; [x,y,z] ; ['a' .. 'z'] ;
as
mempty ; singleton x `mappend` singleton y `mappend` singleton z ; genericEnumFromTo 'a' 'z' ;
This is very interesting. As Michael mentions later, we already have mechanisms in place to work around the creation of constant strings for the Text and ByteString types, and they rely on a combination of GHC rewrite rules and knowledge about the internal representation of constant strings used by GHC. We are fortunate that GHC uses a very efficient representation to store constant strings, so doing the translation is efficient. Constant lists are another story entirely (for good reason); the generated object files are bloated and poorly laid out, when for simple types (integers and whatnot), I'd really like to see a packed array in the .data section. I would be interested to see if an approach that avoids list construction can also aim to achieve a more efficient object file layout, with the implied goal being to make fast translation to the runtime representation easily achievable.

On 9/24/12 8:53 AM, George Giorgidze wrote:
We will release GHC patches for both approaches, meanwhile the feedback from the community on the approaches that we took would be very much appreciated. Which one those would you prefer? or would you suggest a different one.
The first one is much cleaner, and more closely mirrors the other overloaded literals. It seems that in most cases the intermediate list should be eliminated via build/foldr fusion. Did you do any testing to figure out why that fusion wasn't happening? (I.e., *why* is the generic approach faster?) The only other thing I'll mention is that for overloadable strings, part of the reason why they're so fast is that string literals are stored a la C, and so the conversion to ByteString and Text requires minimal work. I wonder if you might be able to leverage a similar technique for representing list literals as vectors, which are then inflated to lists/vectors/sets/whatever at runtime. -- Live well, ~wren
participants (7)
-
Bryan O'Sullivan
-
George Giorgidze
-
Mario Blažević
-
Michael Snoyman
-
Simon Peyton-Jones
-
Sjoerd Visscher
-
wren ng thornton