Generic tries (long)

Hi, I'm writing a library for generic tries for the Summer of Code. The main point of this post is to get some feedback on the api but I'll briefly explain the idea. The point of a trie is to exploit the recursive nature of ADTs to save on expensive key comparisons and reduce space consumption. Hinze' original formulation is very elegant but results in very deep structures and is fairly inefficient. The normal trie optimisations (concatenating singleton maps, mainly) cant be applied to the generic version. I intend instead to encode ADTs as lists. This encoding can range from a simple walk of the ADT to creating a compressed, bit packing representation. The resulting tradeoff between encoding time and space usage should make the design fairly flexible. This will look something like: class Serial k where -- | Flattened form of key consists of a list of Nodes -- Node will be Int or UArray Int for compressed implementations type Node k -- | Flatten to a list of Nodes serialise :: k -> [Node k] -- | Reconstruct from a list of Nodes unserialise :: [Node k] -> k The api below is mostly cribbed from Adrian Hey's initial design. Guarantees about ordering will probably vary between maps. All ascending have descending version too. Strict versions of functions will be written where appropriate, I've omitted them here for brevity. Key reconstruction is likely to be expensive so it may make more sense to seperate foldrKeys and friends into a seperate class. Adrian has written instance of GMap for lists, UInts and Ord types so I can declare instance (Serial k) => GMap (ListGMap (Node k)) k where ... etc I should have a spot on code.haskell.org soon, at which point I'll put up a Haddock page with the most up to date version of the api. class GMap map where type k -- | The empty map. empty :: map a -- | Create a map with a single association. singleton :: k -> a -> map a -- | Create a map from a list of associations which /must/ be in ascending order of keys -- (with /no/ duplicate keys). If in doubt use one of the safer (but slower) 'fromAssocs' functions. fromAssocsAscending :: [(k,a)] -> map a -- | Return 'True' if the map contains no associations. isEmpty :: map a -> Bool -- | Return 'True' if the map contains exactly one association. isSingleton :: map a -> Bool -- | Return the value associated with the supplied key (if any). lookup :: k -> map a -> Maybe a -- | Insert a new association in the map if there is currently no value associated with the key. -- If there is a value associated with the key then replace it with the result of -- applying the supplied function to that value. insert :: (a -> a) -> k -> a -> map a -> map a -- | Delete the association for the supplied key (if any). delete :: k -> map a -> map a -- | This is a combined insert\/modify\/delete operation. The argument to the supplied function -- is ('Just' a) if there is a value (a) associated with the supplied key, otherwise 'Nothing'. -- If the return value is ('Just' a'), a' becomes the new value associated with the supplied key. -- If the return value is 'Nothing', the association for the supplied key (if any) is deleted. alter :: (Maybe a -> Maybe a) -> k -> map a -> map a -- | Evaluate the union of two maps. If the maps contain common keys then combine the -- values associated with those keys using the supplied function. The value arguments -- to this function are supplied in the same order as the map arguments. union :: (a -> a -> a) -> map a -> map a -> map a -- | Evaluate the intersection of two maps, combining common associations using the supplied function. intersection :: (a -> b -> c) -> map a -> map b -> map c -- | Evaluate the difference between two maps. For any key occuring in the second map, -- the corresponding association (if any) is deleted from the first map. -- The associated values in the second map are irrelevant. difference :: map a -> map b -> map a -- | Returns true if the keys in the first map are a subset of the keys in the second map. -- (This includes the case where the key sets are identical). Note that this function does -- not examine the associated values (which are irrelevant). See 'isSubmapOf' if you -- do want associated values examined. isSubsetOf :: map a -> map b -> Bool -- | Returns true if the keys in the first map are a subset of the keys in the second map -- and the corresponding function always returns true when applied to the values associated -- with matching keys. isSubmapOf :: (a -> b -> Bool) -> map a -> map b -> Bool -- | Apply the supplied function to every associated value in the map. map :: (a -> b) -> map a -> map b -- | Apply the supplied function to every association in the map, and use the result -- as the new associated value for the corresponding key. mapWithKey :: (k -> a -> b) -> map a -> map b -- | Delete associations for which the supplied predicate returns 'False' when applied to -- the associated value. filter :: (a -> Bool) -> map a -> map a -- | Fold right over the list of elements in ascending order of keys. -- See 'foldrElemsAscending'' for a strict version of this function. foldrElemsAscending :: (a -> b -> b) -> map a -> b -> b -- | Fold right over the list of keys in ascending order. -- See 'foldrKeysAscending'' for a strict version of this function. foldrKeysAscending :: (k -> b -> b) -> map a -> b -> b -- | Fold right over the list of associations in ascending order of keys. -- See 'foldrAssocsAscending'' for a strict version of this function. foldrAssocsAscending :: (k -> a -> b -> b) -> map a -> b -> b -- | Fold over elements in un-specified order using /unboxed/ Int accumulator (with GHC). -- Defaults to boxed Int for other Haskells. Typically used for counting functions. -- Implementations are free to traverse the map in any order. -- The folded function is always applied strictly. foldElemsUINT :: (a -> UINT -> UINT) -> map a -> UINT -> UINT In addition there a few functions which are useful for groups of maps or nested maps. -- | Add the number of associations in a map to the supplied /unboxed/ Int (with GHC). -- Defaults to boxed Int for other Haskells. addSize :: map a -> UINT -> UINT -- | Find the value associated with the supplied key (if any) and return the result -- of applying the supplied continuation function to that value. Useful for nested lookup. lookupCont :: (a -> Maybe b) -> k -> map a -> Maybe b -- | Reject empty maps (return Nothing). nonEmpty :: map a -> Maybe (map a) The following functions are useful internally as most of the maps are defined in terms of simpler maps. Also see this thread on the need for unionMaybe. -- | Similar to 'insert', but the association is deleted if the supplied function returns 'Nothing'. -- (The supplied function is always applied strictly.) insertMaybe :: (a -> Maybe a) -> k -> a -> map a -> map a -- | Find the value associated with the supplied key (if any) and apply the supplied function -- to that value. Delete the association if the result is 'Nothing'. Replace the old value with -- the new value if the result is ('Just' something). -- (The supplied function is always applied strictly.) deleteMaybe :: (a -> Maybe a) -> k -> map a -> map a -- | Evaluate the union of two maps, but delete combined associations from the result map -- if the combining function returns 'Nothing'. -- (The combining function is always applied strictly.) unionMaybe :: (a -> a -> Maybe a) -> map a -> map a -> map a -- | Evaluate the intersection of two maps, but delete combined associations from the result map -- if the combining function returns 'Nothing'. -- (The combining function is always applied strictly.) intersectionMaybe :: (a -> b -> Maybe c) -> map a -> map b -> map c -- | Difference with a combining function. If the combining function returns -- @Just a@ then the corresponding association is not deleted from the result map -- (it is retained with @a@ as the associated value). differenceMaybe :: (a -> b -> Maybe a) -> map a -> map b -> map a -- | Apply the supplied function to every associated value in the map. -- If the result is 'Nothing' then the delete the corresponding association. -- (The supplied function is always applied strictly.) mapMaybe :: (a -> Maybe b) -> map a -> map b Thanks Jamie

Hello All, Folks, could you please check this out and comment on the proposed class API(s), in particular if there's anything that's really bugged you about Data.Map/Set etc now is your chance to get a (hopefully) decent performing lib with your personal wishlist items included. Jamie, I'm sure it would help a lot if you could produce (and update if necessary) documented API proposal using Haddock. The email is rather difficult to follow IMHO (at least with my email client)
Guarantees about ordering will probably vary between maps.
I take it you mean different concrete implementations. For classes I think it's important to be clear exactly what is being promised wrt ordering, and that promise must be kept by *all* instances. A "class law" that applies to some instances and not others is no use at all really (we'll end up with kind of mess and confusion we seem to have for the Ord/Eq classes at present :-) As having an explicit Ord constraint is objectionable to some folk and there might indeed be efficient implementations that don't offer any ordering guarantees, it may be good to break that API into an unordered Map class and have a separate subclass for methods that are supposed to deliver on ordering promises. Regarding ordering promises, what do folk think they should be? I think ideally we would want the same Ordering as the corresponding Ord instance (if it exists), but this is hard to guarantee as typically we won't have any control over the Ord instance definition (and won't be using the compare method). Would it be OK to simply to promise that the Ordering is the same as that that would be wholly derived? In which case we wouldn't want the Ord constraint, but we probably would want our own private compare method (which generally won't agree with Ords compare). If we want ordering that's guaranteed to be consistent with Ord (and have a meaningful Ord constraint), I can think of two ways to do this: 1- Have users write the matching instance by hand (a huge amount of work if we're talking about a generalised trie). 2- Fall back on balanced tree implementation and actually use Ords compare method (slow). Regards -- Adrian Hey

On Mon, Jun 16, 2008 at 9:28 PM, Jamie Brandon
Hi, I'm writing a library for generic tries for the Summer of Code. The main point of this post is to get some feedback on the api but I'll briefly explain the idea.
The point of a trie is to exploit the recursive nature of ADTs to save on expensive key comparisons and reduce space consumption. Hinze' original formulation is very elegant but results in very deep structures and is fairly inefficient. The normal trie optimisations (concatenating singleton maps, mainly) cant be applied to the generic version.
I intend instead to encode ADTs as lists. This encoding can range from a simple walk of the ADT to creating a compressed, bit packing representation. The resulting tradeoff between encoding time and space usage should make the design fairly flexible. This will look something like:
class Serial k where
-- | Flattened form of key consists of a list of Nodes -- Node will be Int or UArray Int for compressed implementations type Node k
-- | Flatten to a list of Nodes serialise :: k -> [Node k]
-- | Reconstruct from a list of Nodes unserialise :: [Node k] -> k
The api below is mostly cribbed from Adrian Hey's initial design. Guarantees about ordering will probably vary between maps. All ascending have descending version too. Strict versions of functions will be written where appropriate, I've omitted them here for brevity. Key reconstruction is likely to be expensive so it may make more sense to seperate foldrKeys and friends into a seperate class.
Adrian has written instance of GMap for lists, UInts and Ord types so I can declare instance (Serial k) => GMap (ListGMap (Node k)) k where ... etc
I should have a spot on code.haskell.org soon, at which point I'll put up a Haddock page with the most up to date version of the api.
class GMap map where
type k
-- | The empty map. empty :: map a
-- | Create a map with a single association. singleton :: k -> a -> map a
-- | Create a map from a list of associations which /must/ be in ascending order of keys -- (with /no/ duplicate keys). If in doubt use one of the safer (but slower) 'fromAssocs' functions. fromAssocsAscending :: [(k,a)] -> map a
-- | Return 'True' if the map contains no associations. isEmpty :: map a -> Bool
-- | Return 'True' if the map contains exactly one association. isSingleton :: map a -> Bool
-- | Return the value associated with the supplied key (if any). lookup :: k -> map a -> Maybe a
-- | Insert a new association in the map if there is currently no value associated with the key. -- If there is a value associated with the key then replace it with the result of -- applying the supplied function to that value. insert :: (a -> a) -> k -> a -> map a -> map a
-- | Delete the association for the supplied key (if any). delete :: k -> map a -> map a
-- | This is a combined insert\/modify\/delete operation. The argument to the supplied function -- is ('Just' a) if there is a value (a) associated with the supplied key, otherwise 'Nothing'. -- If the return value is ('Just' a'), a' becomes the new value associated with the supplied key. -- If the return value is 'Nothing', the association for the supplied key (if any) is deleted. alter :: (Maybe a -> Maybe a) -> k -> map a -> map a
-- | Evaluate the union of two maps. If the maps contain common keys then combine the -- values associated with those keys using the supplied function. The value arguments -- to this function are supplied in the same order as the map arguments. union :: (a -> a -> a) -> map a -> map a -> map a
-- | Evaluate the intersection of two maps, combining common associations using the supplied function. intersection :: (a -> b -> c) -> map a -> map b -> map c
-- | Evaluate the difference between two maps. For any key occuring in the second map, -- the corresponding association (if any) is deleted from the first map. -- The associated values in the second map are irrelevant. difference :: map a -> map b -> map a
-- | Returns true if the keys in the first map are a subset of the keys in the second map. -- (This includes the case where the key sets are identical). Note that this function does -- not examine the associated values (which are irrelevant). See 'isSubmapOf' if you -- do want associated values examined. isSubsetOf :: map a -> map b -> Bool
-- | Returns true if the keys in the first map are a subset of the keys in the second map -- and the corresponding function always returns true when applied to the values associated -- with matching keys. isSubmapOf :: (a -> b -> Bool) -> map a -> map b -> Bool
-- | Apply the supplied function to every associated value in the map. map :: (a -> b) -> map a -> map b
-- | Apply the supplied function to every association in the map, and use the result -- as the new associated value for the corresponding key. mapWithKey :: (k -> a -> b) -> map a -> map b
-- | Delete associations for which the supplied predicate returns 'False' when applied to -- the associated value. filter :: (a -> Bool) -> map a -> map a
-- | Fold right over the list of elements in ascending order of keys. -- See 'foldrElemsAscending'' for a strict version of this function. foldrElemsAscending :: (a -> b -> b) -> map a -> b -> b
-- | Fold right over the list of keys in ascending order. -- See 'foldrKeysAscending'' for a strict version of this function. foldrKeysAscending :: (k -> b -> b) -> map a -> b -> b
-- | Fold right over the list of associations in ascending order of keys. -- See 'foldrAssocsAscending'' for a strict version of this function. foldrAssocsAscending :: (k -> a -> b -> b) -> map a -> b -> b
-- | Fold over elements in un-specified order using /unboxed/ Int accumulator (with GHC). -- Defaults to boxed Int for other Haskells. Typically used for counting functions. -- Implementations are free to traverse the map in any order. -- The folded function is always applied strictly. foldElemsUINT :: (a -> UINT -> UINT) -> map a -> UINT -> UINT
In addition there a few functions which are useful for groups of maps or nested maps.
-- | Add the number of associations in a map to the supplied /unboxed/ Int (with GHC). -- Defaults to boxed Int for other Haskells. addSize :: map a -> UINT -> UINT
-- | Find the value associated with the supplied key (if any) and return the result -- of applying the supplied continuation function to that value. Useful for nested lookup. lookupCont :: (a -> Maybe b) -> k -> map a -> Maybe b
-- | Reject empty maps (return Nothing). nonEmpty :: map a -> Maybe (map a)
The following functions are useful internally as most of the maps are defined in terms of simpler maps. Also see this thread on the need for unionMaybe.
-- | Similar to 'insert', but the association is deleted if the supplied function returns 'Nothing'. -- (The supplied function is always applied strictly.) insertMaybe :: (a -> Maybe a) -> k -> a -> map a -> map a
-- | Find the value associated with the supplied key (if any) and apply the supplied function -- to that value. Delete the association if the result is 'Nothing'. Replace the old value with -- the new value if the result is ('Just' something). -- (The supplied function is always applied strictly.) deleteMaybe :: (a -> Maybe a) -> k -> map a -> map a
-- | Evaluate the union of two maps, but delete combined associations from the result map -- if the combining function returns 'Nothing'. -- (The combining function is always applied strictly.) unionMaybe :: (a -> a -> Maybe a) -> map a -> map a -> map a
-- | Evaluate the intersection of two maps, but delete combined associations from the result map -- if the combining function returns 'Nothing'. -- (The combining function is always applied strictly.) intersectionMaybe :: (a -> b -> Maybe c) -> map a -> map b -> map c
-- | Difference with a combining function. If the combining function returns -- @Just a@ then the corresponding association is not deleted from the result map -- (it is retained with @a@ as the associated value). differenceMaybe :: (a -> b -> Maybe a) -> map a -> map b -> map a
-- | Apply the supplied function to every associated value in the map. -- If the result is 'Nothing' then the delete the corresponding association. -- (The supplied function is always applied strictly.) mapMaybe :: (a -> Maybe b) -> map a -> map b
Thanks
Jamie _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
Just a thought: why not have: import Data.Traversable class Traversable map => GMap map where ... (remove 'map') ... Because all maps ought to be traversable. See: http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Traversable.html Bas

Ive put up the haddock api at http://code.haskell.org/gmap/api/GMap.html The interface is now split into GMap which make no guarantees on ordering and OrderedGMap which provides a compareKey function as well as ordered folds etc.
Because all maps ought to be traversable.
Thats a good point. Traversable requires Foldable. Do people expect Foldable to imply some ordering? I can put it into OrderedGMap at least. Jamie

Jamie Brandon wrote:
Ive put up the haddock api at http://code.haskell.org/gmap/api/GMap.html
The interface is now split into GMap which make no guarantees on ordering and OrderedGMap which provides a compareKey function as well as ordered folds etc.
Because all maps ought to be traversable.
Thats a good point. Traversable requires Foldable. Do people expect Foldable to imply some ordering? I can put it into OrderedGMap at least.
Foldable does, trivially, imply ordering: "fold" can distinguish the order. So you might have to be careful about breaking referential transparency (at least if you want "==" GMaps to always be extensionally equivalent). But obviously a container without some concept of ordering will probably not be in any particularly *meaningful* order! It should still be Foldable if there's no reason not to (which is useful at least for commutative+associative ops like "+"). All IMHO. -Isaac

Jamie Brandon wrote:
Ive put up the haddock api at http://code.haskell.org/gmap/api/GMap.html
I have a few comments on the proposed API. 1) Some terminology seems queer to me. In particular, I don't like the terms "assoc" and "association" in the documentation. The Data.Map documentation uses "key/value pair" or just "key" and "value"/"element" instead, which I think is much better. For instance: "Insert a new association in the map" :( "Insert a new key and value in the map." :) Also, I like fromList better than fromAssocs . 2) Don't use Int# , looks like a premature optimization to me. Furthermore, I'd change the queer addSize to simply size :: map a -> Int 3) insert is strange. Why not use the convention in Data.Map, name it insertWith and have insert :: k -> a -> map a -> map a instead? Similar for union . 4) Most functions, in particular the ..Maybe variants have fixed default definitions. I wouldn't include them in the class. How about a minimal class interface along the lines of class GMap map k | map -> k where empty :: map a null :: map a -> Bool lookup :: k -> map a -> Maybe a alter :: (Maybe a -> Maybe a) -> k -> map a -> map a merge :: (k -> Maybe a -> Maybe b -> Maybe c) -> map a -> map b -> map c fromList :: [(k,a)] -> map a toList :: map a -> [(k,a)] instance Functor map where ... and implementing the various variants as normal polymorphic functions? Like insert k a = alter (const $ Just a) k singleton k a = insert k a empty IMHO, having a few "flag" functions like difference :: k -> Maybe a -> Maybe b -> Maybe b union :: k -> Maybe a -> Maybe a -> Maybe a intersect :: k -> Maybe a -> Maybe a -> Maybe a that can be plugged into merge is much nicer than having to remember/lookup all the ..Maybe and whatnot'' variants. 5) I think the following functions are queer lookupCont f k m = lookup k m >>= f pair k1 k2 = if k1 == k2 then Nothing else Just $ \a1 a2 -> insert k1 a1 (insert k2 a2 empty) What use is pair ? I'd definitively put them outside the class if not remove them entirely. 6) status is an interesting idea with a meaningless name. Don't you have a name that is more to the point, that expresses the nub (pun intended) of what status does? 7) I'm overwhelmed by the many foldr...Asc/Descending functions. They all can be expressed in terms of toListAsc and toListDesc and I don't think that "marginally faster" is a reason not to do so. I'd throw away the from...L functions, too. Regards, apfelmus

Hello apfelmus, Thanks for taking the time to look at this. As what Jamie has posted is largely based on my own initial efforts I can offer some insight about what's going on here and why the class API looks like it does. The first thing to note is that what Jamie has posted is the proposed class methods *only*. It's not the complete user level map API. There's also a whole lot of more convenient and sensible looking functions that are just regular overloaded functions. Unfortunately these are not visible in the API posted. Among them are things like this.. size :: GT map k => map a -> Int size mp = ASINT(addSize mp L(0)) (ASINT and L are a cpp macros) elemsAscending :: GT map k => map a -> [a] elemsAscending mp = foldrElemsAscending (:) mp [] assocsAscending :: GT map k => map a -> [(k,a)] assocsAscending mp = foldrAssocsAscending (\k a assocs -> (k,a):assocs) mp [] keysAscending :: GT map k => map a -> [k] keysAscending mp = foldrKeysAscending (:) mp [] The second thing to note is that the class API has been designed with the *implementation* of generalised tries in mind. It's not necessary that any instance is actually implemented as any kind of trie, but it is necessary that the resulting API contains what's needed to enable its use in other instances that are based on generalised tries. So the actual class methods chosen are designed to be convenience for generalised trie implementation, not typical map users. The types are chosen to reflect how they will likely be used in other generalised trie implementations and the functionality provided is what it seems is actually needed to do this efficiently. What the class API contains has kind of evolved empirically with stuff being added as and when it was discovered it was needed (by implementing a few common instancance by hand). Here's my thoughtS about your specific observations.. apfelmus wrote:
1) Some terminology seems queer to me. In particular, I don't like the terms "assoc" and "association" in the documentation. The Data.Map documentation uses "key/value pair" or just "key" and "value"/"element" instead, which I think is much better. For instance:
"Insert a new association in the map" :( "Insert a new key and value in the map." :)
Also, I like fromList better than fromAssocs .
Well this is one of those bike shed arguments :-) I'm easy about it so if Jamie agrees with you that's fine. Use of the term "association" seems quite common. I've seen numereous uses of the term "association list", never seen anyone talk about "key/value pair list".
2) Don't use Int# , looks like a premature optimization to me. Furthermore, I'd change the queer addSize to simply
size :: map a -> Int
This does exist (see above), but it's not a class method. One could argue that is was unnecessary to make unboxing explicit. It's something I got into the habit of doing because it's way easier to do that than is inspecting ghc's output to make it's done it on its own (and figuring out what to do about it if it hasn't). Also, because of the nested nature generalised tries addSize is more convenient for implementors than size IMO.
3) insert is strange. Why not use the convention in Data.Map, name it insertWith and have
insert :: k -> a -> map a -> map a instead?
Yes, this this should probably be added to the user API. As a choice of primitive class method, the current form seems more appropriate (though perhaps not it's name). All the common user level variants can easily be implemented with this primitive.
Similar for union .
Actually I don't like Data.Maps union much. It would be deprecated if I ruled the world. I think users should be always be made to specify how overlapping values are to be combined (or discarded or whatever). So the name union is now free we don't need to qualify the function that is explicit about this using a "With" suffix.
4) Most functions, in particular the ..Maybe variants have fixed default definitions. I wouldn't include them in the class. How about a minimal class interface along the lines of
class GMap map k | map -> k where empty :: map a null :: map a -> Bool
lookup :: k -> map a -> Maybe a alter :: (Maybe a -> Maybe a) -> k -> map a -> map a merge :: (k -> Maybe a -> Maybe b -> Maybe c) -> map a -> map b -> map c
fromList :: [(k,a)] -> map a toList :: map a -> [(k,a)]
instance Functor map where ...
and implementing the various variants as normal polymorphic functions? Like
insert k a = alter (const $ Just a) k singleton k a = insert k a empty
IMHO, having a few "flag" functions like
difference :: k -> Maybe a -> Maybe b -> Maybe b union :: k -> Maybe a -> Maybe a -> Maybe a intersect :: k -> Maybe a -> Maybe a -> Maybe a
that can be plugged into merge is much nicer than having to remember/lookup all the ..Maybe and whatnot'' variants.
I've never been very keen on alter myself (on efficiency grounds) and was wondering whether of not to include it. If the "altering" results in an unchanged map it would be nice to just return the unchanged map (rather than duplicate all nodes on the search path). There are other possible alternatives to alter that are more efficient in this respect. The merge function looks like an interesting idea, but it's not clear to me that it can always (or ever even :-) be implemented as efficiently as the more specialised versions. Maybe as and experiment we could implement it and if it turns out that union,intersection etc can be implemeted using it with liitle or no extra cost then we could put them in the convinience API instead (not as class methods).
5) I think the following functions are queer
lookupCont f k m = lookup k m >>= f
Or perhaps.. lookup = lookupCont Just Yes, there's probably unwanted duplication in the class methods there. My vote would be to keep lookupCont and a class method and have lookup as regular overloaded function.
pair k1 k2 = if k1 == k2 then Nothing else Just $ \a1 a2 -> insert k1 a1 (insert k2 a2 empty)
What use is pair ? I'd definitively put them outside the class if not remove them entirely.
Again, pair is one of those things that a typical Map user wouldn't use, but is definitely needed to implement an efficient trie for Lists and probably product types in general. The reason for it's existance is that singleton maps need special treatment (you want to avoid making long chains of them). Unfortunately the above definition is inefficient. You've already done most of the work requred to evaluate pair in the first equality test. This would all have to be repeated in the second insertion. Have a look at the ListGT module to see how pair is defined and used for the gory details.
6) status is an interesting idea with a meaningless name. Don't you have a name that is more to the point, that expresses the nub (pun intended) of what status does?
If you're suggesting it should be called nub, that seems confusing considering that name is already used in Data.List to mean something quite different. Perhaps you can suggest some other name?
7) I'm overwhelmed by the many foldr...Asc/Descending functions. They all can be expressed in terms of toListAsc and toListDesc and I don't think that "marginally faster" is a reason not to do so.
..or vice-versa, but without the need for list deforestation to get the same efficiency. Also I think we need to distinguish between the variants that require key reconstruction from those that don't (keys generally aren't stored in a trie).
I'd throw away the from...L functions, too.
The L versions can improve performance in some map implementations where length is needed and is already known as by product of other stuff that's going on. In such cases it seems to make sense to pass it as an argument rather than incur the cost of evaluating it again from scratch. Anway, I don't think we should get to worried about the precise details of class methods right now. I think the main concern should be getting the class hierarchy right, I'm not sure that it at present. e.g. There may be specialised representations that support lookup very efficiently but not much else. Should we have a separate class for them? Should we have separate classes for implementations that store keys (hence no key reconstruction cost) and those that must reconstruct keys? Then there's the ordering can of worms too. This is the sort of thing that really needs to be got right from the start IMHO. Regards -- Adrian Hey

On Fri, Jun 20, 2008 at 11:51:40AM +0100, Adrian Hey wrote:
2) Don't use Int# , looks like a premature optimization to me. Furthermore, I'd change the queer addSize to simply
size :: map a -> Int
This does exist (see above), but it's not a class method. One could argue that is was unnecessary to make unboxing explicit. It's something I got into the habit of doing because it's way easier to do that than is inspecting ghc's output to make it's done it on its own (and figuring out what to do about it if it hasn't). Also, because of the nested nature generalised tries addSize is more convenient for implementors than size IMO.
I agree that using Int# is a bad idea. It makes the package unportable, and it forces all the packages that implement the class to write code using Int#. Thanks Ian

Ian Lynagh wrote:
I agree that using Int# is a bad idea. It makes the package unportable, and it forces all the packages that implement the class to write code using Int#.
As they should IMO :-) I wish this wasn't necessary, but my experience in the past tells me it is. ghc's optimisations seem too fragile to rely on for this. BTW, it's actually controlled by cpp macros, so the unboxed versions are ghc only. Unfortunately you can't tell this from the Haddock because Haddock doesn't understand cpp. But in this case it scarcely matters as it will be making use of type families which are a only going to supported by ghc anyway for the forseable future AFAIK. Regards -- Adrina Hey

Adrian Hey wrote:
The first thing to note is that what Jamie has posted is the proposed class methods *only*. It's not the complete user level map API.
The second thing to note is that the class API has been designed with the *implementation* of generalised tries in mind.
Ah ok, I thought that would be the user API already. So, the class API is for "users" who want to plug in their own generalized trie implementations? In any case, I'm all for keeping it simple and minimal, possible minor efficiency penalties notwithstanding :)
I don't like the terms "assoc" and "association"
Well this is one of those bike shed arguments :-) I'm easy about it so if Jamie agrees with you that's fine. Use of the term "association" seems quite common. I've seen numereous uses of the term "association list", never seen anyone talk about "key/value pair list".
The term "association list" seems to be atomic, i.e. it's not a list of associations. The dictionary of A&DS ( http://www.nist.gov/dads/ ) always says "key" and "value", though it doesn't mention pairs of these directly. The documentation for Data.Map uses "key/value pair" explicitly.
alter :: (Maybe a -> Maybe a) -> k -> map a -> map a
I've never been very keen on alter myself (on efficiency grounds) and was wondering whether of not to include it. If the "altering" results in an unchanged map it would be nice to just return the unchanged map (rather than duplicate all nodes on the search path). There are other possible alternatives to alter that are more efficient in this respect.
You mean the case when f Nothing = Nothing in alter f .. ? Hm, maybe some zipper-like extension of lookup can do the trick focus :: k -> map a -> (Maybe a, Maybe a -> map a) lookup k = fst . focus k delete k m = case focus k m of (Nothing, _) -> m (_ , g) -> g Nothing alter f k m = case focus k m of (Nothing, g) -> case f Nothing of Nothing -> m x -> g x (x , g) -> g x
6) status is an interesting idea with a meaningless name. Don't you have a name that is more to the point, that expresses the nub (pun intended) of what status does?
If you're suggesting it should be called nub, that seems confusing considering that name is already used in Data.List to mean something quite different. Perhaps you can suggest some other name?
No, it shouldn't be nub of course, just a similarly succinct yet meaningful name. Hm, maybe monkeyView in accordance with the way the monkeys count: one, two, many. :) Regards, apfelmus

Hello apfelmus, apfelmus wrote:
I've never been very keen on alter myself (on efficiency grounds) and was wondering whether of not to include it. If the "altering" results in an unchanged map it would be nice to just return the unchanged map (rather than duplicate all nodes on the search path). There are other possible alternatives to alter that are more efficient in this respect.
You mean the case when
f Nothing = Nothing
or, for some a .. f (Just a) = Just a ..but of course there's no way you can tell from inside the alter function that the two a's are the same without an expensive equality test, and even then that's not enough because you need to pass the "no change" information back up the call chain (as a Nothing probably) which means that if there is a change even more heap will be burned by wrapping each intermediate result in a Just.
in alter f .. ? Hm, maybe some zipper-like extension of lookup can do the trick
focus :: k -> map a -> (Maybe a, Maybe a -> map a)
lookup k = fst . focus k delete k m = case focus k m of (Nothing, _) -> m (_ , g) -> g Nothing alter f k m = case focus k m of (Nothing, g) -> case f Nothing of Nothing -> m x -> g x (x , g) -> g x
I think it depends if this can be implemented without burning significant extra heap in either focus or the resulting g function. Generally zippers do require quite a bit of heap (proportional to trie/tree depth). If you consider what people are actually trying to achieve with alter I actually think it's rather easier to use as a 2 step process and forget about the (Maybe a -> Maybe a) function. After having done a lookup and examined the associated value (if any), you could do.. If search failed: 1f - Do nothing (this is a v. cheap operation :-) 2f - Insert a new association If search succeeded: 1s - Do nothing 2s - Modify the associated value 3s - Delete the association Despite it's generality, alter still fails to properly capture all the options, notably 1f and 1s. I guess it is conceivable that some implementations might be able to deal with 1f efficiently. For the Data.Map clone I wrote something like this .. -- An "open" map (this is abstract) data OMap k a = OMap k (Maybe a) (Map k a) Int# -- This is just a lookup that encodes the path taken as an unboxed Int open :: Ord k => k -> Map k a -> OMap k a -- Get the current associated value (if any) read :: OMap k a -> Maybe a -- Change the current associated value and close the new map -- This is v.fast. No comparisons, and no balance checking or -- rebalancing either if this is a substitution rather than an -- insertion. write :: a -> OMap k a -> Map k a -- Delete the current association (if any) and close the new map -- This is nop if there is no current association delete :: OMap k a -> Map k a -- Not really needed if original map is still in scope close :: OMap k a -> Map k a open burns no heap at all other than to construct the OMap record (possibly not even that) and possibly the (Just a). If it turns out there's no need to write or delete then don't (end of story). Even if a write is needed the whole combined process only takes about 10% longer than a normal insert, and that's with cheap comparisons (Ints). Admitedly for a trie the path would probably be something a bit more complex than an unboxed Int, but we could use a similar API. Regards -- Adrian Hey

Adrian Hey wrote:
For the Data.Map clone I wrote something like this ..
-- An "open" map (this is abstract) data OMap k a = OMap k (Maybe a) (Map k a) Int#
-- This is just a lookup that encodes the path taken as an unboxed Int open :: Ord k => k -> Map k a -> OMap k a
-- Get the current associated value (if any) read :: OMap k a -> Maybe a
-- Change the current associated value and close the new map -- This is v.fast. No comparisons, and no balance checking or -- rebalancing either if this is a substitution rather than an -- insertion. write :: a -> OMap k a -> Map k a
-- Delete the current association (if any) and close the new map -- This is nop if there is no current association delete :: OMap k a -> Map k a
-- Not really needed if original map is still in scope close :: OMap k a -> Map k a
I think it depends if this can be implemented without burning significant extra heap in either focus or the resulting g function. Generally zippers do require quite a bit of heap (proportional to trie/tree depth).
The OMap type is like a zipper, the Int# encodes the path. I don't know whether the Int# (which should be a !Int with an UNPACK pragma) really gains anything compared to a list, only benchmarks can tell. Fretting about it seems like an irrelevant micro-optimization to me. In any case, focus can easily be implemented in terms of OMap : focus :: k -> map a -> (Maybe a, Maybe a -> map a) focus k m = (read z, maybe (delete z) (`write` z)) where z = open k m So any efficient implementation for OMap gives an efficient implementation for focus . And vice-versa type OMap k a = (Maybe a, Maybe a -> Map k a) open = focus read = fst write x = ($ Just x ) . snd delete = ($ Nothing) . snd Regards, apfelmus

Hello apfelmus, apfelmus wrote:
The OMap type is like a zipper, the Int# encodes the path. I don't know whether the Int# (which should be a !Int with an UNPACK pragma) really gains anything compared to a list, only benchmarks can tell. Fretting about it seems like an irrelevant micro-optimization to me.
Actually MHO is that using proper language constructs (where available) is always preferable to pragma or compliler flag hackery :-) As for this kind of thing being irrelevant micro-optimization, this is not the case. If you consider what a typical Haskell prog spends it's time doing.. 1 - Traversing heap data structures 2 - Building new heap records 3 - Collecting garbage Excessive heap allocation rate has a big impact on time spent on both 2 and 3, and hence on overall performance. A fact that has been confirmed in just about every benchmark I've ever run. Here's the results of one I posted recently http://www.haskell.org/pipermail/haskell-cafe/2008-February/039882.html So if performance is a concern, MHO is that heap allocation should be avoided like the proverbial plague. In a world of immutable data structures it's always likely to be on the high side anyway, but anything that can be done to reduce it will boost performance. I do think a general purpose zipper would be a great thing to have, especially for ordered maps. But for the particular problem that sparked this discussion I suspect (though I don't know) that it might be expensive overkill. It needs thinking about. (There's at least one case I know of where there is a much cheaper solution.)
In any case, focus can easily be implemented in terms of OMap :
Yes, of course. The point I was trying to make about OMap was that if we're going to have full zipper available, one where you could actually "walk the map", inspecting, modifying, inserting and deleting as you go (call it ZMap say), then the OMap distinction with (corresponding impoverished API) is useful. Just making the point that the OMap or something like it is all that's needed to solve this particular problem (if it can be implemented cheaply). Regards -- Adrian Hey

On Sat, Jun 21, 2008 at 11:54 PM, Adrian Hey
Hello apfelmus,
apfelmus wrote:
I've never been very keen on alter myself (on efficiency grounds) and was wondering whether of not to include it. If the "altering" results in an unchanged map it would be nice to just return the unchanged map (rather than duplicate all nodes on the search path). There are other possible alternatives to alter that are more efficient in this respect.
You mean the case when
f Nothing = Nothing
or, for some a ..
f (Just a) = Just a
..but of course there's no way you can tell from inside the alter function that the two a's are the same without an expensive equality test, and even then that's not enough because you need to pass the "no change" information back up the call chain (as a Nothing probably) which means that if there is a change even more heap will be burned by wrapping each intermediate result in a Just.
Thinking as a C++ programmer for a bit, I'd want to implement this as a pointer comparison. If the argument to alter returns the same object (not just an equal one), then the implementation can cheaply avoid creating the new sub-tree. If it returns an equal one, you'd like to skip the copy, but it's not worth the equality test. Is there a reason I'm missing that it's a bad idea to use GHS.Prim.reallyUnsafePtrEquality# or a similar low-level function in the alter implementation? If alter were to cheat like this, we'd want to write its argument as: f n@Nothing = n f j@(Just a) = j instead of risking creating a new object. Trying this out, I figured out why it's called "reallyUnsafe": Prelude GHC.Exts GHC.Prim> let q = Just 3 Prelude GHC.Exts GHC.Prim> let f j@(Just a) = j; r = f q Prelude GHC.Exts GHC.Prim> I# (reallyUnsafePtrEquality# q r) 0 Prelude GHC.Exts GHC.Prim> r Just 3 Prelude GHC.Exts GHC.Prim> I# (reallyUnsafePtrEquality# q r) 1 Prelude GHC.Exts GHC.Prim> and even stranger: Prelude GHC.Exts GHC.Prim> let q = Just 3 Prelude GHC.Exts GHC.Prim> let f j@(Just a) = j; r = f q Prelude GHC.Exts GHC.Prim> I# (r `seq` reallyUnsafePtrEquality# q r) 0 Prelude GHC.Exts GHC.Prim> I# (r `seq` reallyUnsafePtrEquality# q r) 1 So it'd take some doing to get the implementation right, but getting the fully-general API without wasting allocations might be worth it. Jeffrey

Hello Jeffrey, Jeffrey Yasskin wrote:
Is there a reason I'm missing that it's a bad idea to use GHS.Prim.reallyUnsafePtrEquality# or a similar low-level function in the alter implementation?
I think, given the fuss about unboxed Ints, most people would recoil in horror at this :-) I guess what you could do would be to have the function argument of alter and friends return a type (Maybe (Maybe a)) where.. Nothing means delete it Just Nothing means leave it alone Just (Just a) means use a as the new associated value. ..then to pass the change/nochange information back up the call chain you could use an unboxed pair result containing an appropriate Bool in addition to the (possibly modified) subtree. But I can't help wondering if all the functions like this are really the wrong abstraction. They might be what you want sometimes, but often you'd want to combine them with a lookup (without searching the map twice in quick succession using the same key). Something like the proposed zipper based focus, OMap or whatever seems much more general purpose, flexible and easy to use. It's just a question of whether or not it can be implemented efficiently enough to make alter and friends obsolete (at least as class methods), or should it be an addition to an API that still retains alter (deleteMaybe etc..). Regards -- Adrian Hey

Hello apfelmus, apfelmus wrote:
in alter f .. ? Hm, maybe some zipper-like extension of lookup can do the trick
focus :: k -> map a -> (Maybe a, Maybe a -> map a)
lookup k = fst . focus k delete k m = case focus k m of (Nothing, _) -> m (_ , g) -> g Nothing alter f k m = case focus k m of (Nothing, g) -> case f Nothing of Nothing -> m x -> g x (x , g) -> g x
I think I prefer this API now, largely because it doesn't require the definition of a new type (per GMap instance presumbly). Only thing I have against it is that it creates another of those weird dual purpose functions that take (Maybe something) as an argument (don't like them much :-). So possibly instead we could have.. focus :: k -> map a -> (Maybe (a -- associated value (if any) ,map a -- k deleted map (unevaluated thunk) ) ,a -> map a -- Insertion/Substitution ) Again, I don't think this could replace alter and friends as a typical zipper implementation will burn as much heap as you're trying to save (twice as much if you go ahead and delete or substitute). But it would avoid an expensive second search in cases where you want the operation combined with a lookup. Instances that can do this cheaper than full zipper (unboxed Int path say) could define their alter,deleteMaybe etc using it if they wanted. (though AFAICS alter can be defined in terms of insertMaybe/deleteMaybe anyway with no efficiency loss). Regards -- Adrian Hey

Adrian Hey wrote:
I've never been very keen on alter myself (on efficiency grounds) and was wondering whether of not to include it. If the "altering" results in an unchanged map it would be nice to just return the unchanged map (rather than duplicate all nodes on the search path). There are other possible alternatives to alter that are more efficient in this respect.
There's something else that's always bugged me about alter. I'm sure I can't be the only one who thinks requiring me to define a function like this is a very weird thing to do... f :: Maybe a -> Maybe a f Nothing = maybea -- A constant, either Nothing or (Just somea) f (Just a) = f' a I've never felt comfortable with this. In any case I realised that alter can be easily implemented using other class primitives alter f k map = case f Nothing of Just somea -> insertMaybe f' k somea map Nothing -> deleteMaybe f' k map where f' a = f (Just a) but since I know what maybea and f' are, it seems to me it's a lot easier to just use f' directly in either insertMaybe or deleteMaybe as appropriate (depending on maybea). The type of the proposed merge now seems similarly strange to me.. merge :: (k -> Maybe a -> Maybe b -> Maybe c) -> map a -> map b -> map c This requres users to define a function argument that is presumably of form.. f k Nothing Nothing = undefined f k (Just a) Nothing = fa k a f k (Just a) (Just b) = fab k a b f k Nothing (Just b) = fb k b Why not just pass fa,fab and fb directly, which will be more convenient for both users and implementors I think.. merge :: (k -> a -> Maybe c) -> (k -> a -> b -> Maybe c) -> (k -> b -> Maybe c) -> map a -> map b -> map c Though I actually think if this is to be included perhaps it should be called mergeMaybeWithKey, and we also have.. mergeMaybe :: (a -> Maybe c) -> (a -> b -> Maybe c) -> ( b -> Maybe c) -> map a -> map b -> map c merge :: (a -> c) -> (a -> b -> c) -> ( b -> c) -> map a -> map b -> map c mergeWithKey :: (k -> a -> c) -> (k -> a -> b -> c) -> (k -> b -> c) -> map a -> map b -> map c Regards -- Adrian Hey

Adrian Hey wrote:
The type of the proposed merge now seems similarly strange to me..
merge :: (k -> Maybe a -> Maybe b -> Maybe c) -> map a -> map b -> map c
This requres users to define a function argument that is presumably of form
f k Nothing Nothing = undefined f k (Just a) Nothing = fa k a f k (Just a) (Just b) = fab k a b f k Nothing (Just b) = fb k b
Why not just pass fa,fab and fb directly, which will be more convenient for both users and implementors I think..
merge :: (k -> a -> Maybe c) -> (k -> a -> b -> Maybe c) -> (k -> b -> Maybe c) -> map a -> map b -> map c
While every such f must have this form, in the sense that \f k -> (\a -> f k (Just a) Nothing , \a b -> f k (Just a) (Just b), \b -> f k Nothing (Just b)) is an isomorphism, it doesn't mean that it's explicitly implemented that way. The intention was that the library exports ready-made functions union, intersect, difference :: k -> Maybe a -> Maybe a -> Maybe a and combinators like unionWith :: (k -> a -> b -> c) -> (k -> Maybe a -> Maybe b -> Maybe c) that can be plugged into merge , like merge intersect merge (unionWith $ curry snd) Thus, the user doesn't implement the argument to merge himself unless he requires custom behavior. Hence, using one argument instead of three is more convenient here. The particular form union, intersect, difference :: Maybe a -> Maybe a -> Maybe a has mnemonic value as well, since Maybe a is the finite map with one element, so the combinator intersect actually intersects two finite maps. You're probably right concerning the efficiency of merge . The problem is that merge may decide per element whether to intersect, union, difference or something, while the original intersect may only intersect elements and can hence throw whole subtrees away without looking into them. An signature for merge that does not allow per-element tests would be merge :: (Bool -> Bool -> Bool) -> (k -> a -> b -> c) -> map a -> map b -> map c Here, the boolean function determines membership while the second argument determines how to merge two values. There is the small problem that the boolean function f ought to fulfill f False False = False. This can be guaranteed by using a rank-2 type merge :: (forall a. Maybe a -> Maybe a -> Maybe a) -> ... Incidentally, this restores the fact that the first argument combines one-element finite maps. Regards, apfelmus

Hello Jamie, Jamie Brandon wrote:
Ive put up the haddock api at http://code.haskell.org/gmap/api/GMap.html
The ordering issue is still bothering the heck out of me. With the above OrderedGMap class (only) we lose the ability to use tries for sorting faster than the usual O(n . log n), which seems a real shame. Especially as it means folk will often end up having to do a O(n . log n) sort stage on the data anyway (without the benefit of "trie acceleration" presumably). So I think I would really like to see a subclass that *does* guarantee Ordering that is consistent with Ord. The question of how instances of that subclass are produced/derived or whatever is still open. Automated trie deriving could presumably only guarantee ordering consistent with a (wholly) derived Ord instance, not hand written Ord instances. I also find myself wondering if we really need to make the distinction between Map implementations that really can make *no* guarantee about consiststent ordering and maps that do guarantee consistent ordering (that is inconsistent with Ords ordering). I'm struggling to imagine any decent Map implementation that falls into the former category (I.E. is an instance of GMap *only*). I guess hash tables that used a linear list for each bucket are one possible example, hmm.. Also, a minor naming niggle with the folds in the GMap class. I think these should have a "fold" prefix only (not "foldr"), as there is no implied ordering guarantee. I'm also wondering if we should at least have an Eq constraint on keys for GMap. I guess we should. Regards -- Adrian Hey

Adrian Hey wrote:
So I think I would really like to see a subclass that *does* guarantee Ordering that is consistent with Ord.
Why not just use class (Ord k, GMap map k) => OrderedGMap map k where instead of class GMap map k => OrderedGMap map k where compareKey :: k -> k -> Ordering for a class that indicates consistent ordering? I mean, you control the instances. For example, you can implement instance OrdGMap map k => OrdGMap map [k] because you know what the Ord instance for lists looks like. Regards, apfelmus

Why not just use
class (Ord k, GMap map k) => OrderedGMap map k where
Most of the maps I will be implementing will respect the ordering that would be derived by GHC, regardless of the actual Ord instance. There is no way, in general, to derive a trie that respects an existing Ord instance but I still want to expose the bytewise ordering in the trie. We also already have instance Ord k => OrderedGMap OrdGT k where OrdGT use AVL trees.
The term "association list" seems to be atomic, i.e. it's not a list of associations
Its not? 'Assocs' makes sense to me but I guess its best to copy the terminology in Data.Map
focus :: k -> map a -> (Maybe a, Maybe a -> map a)
I like focus, alter and merge. According to quickcheck my serialisation code is now working so in a day or two I should be in a position to get some benchmarks set up. I would definitely prefer the simpler interface if it can be persuaded to run reasonably quickly. If not, I dont want to have a class definition that makes it difficult to write efficient implentations. I would rather have an ugly class and a nicer layer on top.
The first thing to note is that what Jamie has posted is the proposed class methods *only*. It's not the complete user level map API.
Sorry, I'll update the api this evening with the user facing code. I keep meaning to do it and then getting distracted by failed tests in this damn bit munging code. (In my head lists run left to right and machine words run right to left. This makes for some subtle bugs.) For now I'm going to rename insert as insertWith and put insert in the user api. Same for union, intersection etc

Jamie Brandon wrote:
Why not just use
class (Ord k, GMap map k) => OrderedGMap map k where
Most of the maps I will be implementing will respect the ordering that would be derived by GHC, regardless of the actual Ord instance. There is no way, in general, to derive a trie that respects an existing Ord instance but I still want to expose the bytewise ordering in the trie.
Well, you could use a newtype for the key, when necessary. But does that suffer from 1. the risk of accidentally not using a newtype? 2. is it an inconvenience to the users of the map? I can't tell... Is it a bad idea to require an Ord context even if you don't use it? It could supply a default method... but would surely confuse people like (1) above. -Isaac

Hello Jamie, Jamie Brandon wrote:
I like focus, alter and merge.
I've been thinking about the merge function. I think structurally it would be a bit like unionMaybe with mapMaybe applied to the non-overlapping bits. So it's hard to see how this could ever come close to the efficiency of the specialised union,intersection etc.. But it still looks like it could be useful in some situations (as a more efficient and flexible alternative to a union followed by a mapMaybe say), so would be worth implementing I think. ..and of course we'd probably want a version that didn't take keys as argument too.
If not, I dont want to have a class definition that makes it difficult to write efficient implentations. I would rather have an ugly class and a nicer layer on top.
Yes, we don't want to go to all this trouble to produce "efficient" maps and then throw it all away in the name of elegance. MHO too is that if efficiency requires a big ugly class definition then let it be big and ugly :-) Regards -- Adrian Hey

Hi Jamie,
thanks for doing this---I think that the functions in the API look
good. Of course, one can always think of more :-) One that I often
find useful when working with Data.Map is a function that removes a
key from a map and also returns the associated value---a kind of
destructive lookup.
I think that the library would be better if you exposed a
non-overloaded API for the map(s?) that you plan to implement. This
is useful because (i) most of the type I anticipiate using a single
type of map, and concrete types lead to more accurate types in type
inference, (ii) it makes the API Haskell'98 and hence likely to work
on most Haskell implementations out there. If you really think that
the overloading is useful, then you could provide separate modules
that provide the class+instances in terms of the non-overloaded API
and add a cabal flag to turn these on/off.
Also, as others mentioned, using Int# in an API is not great because
this is a GHC internal type. If you found situatuins where GHC is not
unboxing something please send a mail to the GHC team---they are
usually very good at fixing these sorts of things. Adrian mentioned
that you use CPP to control the use of Int# but this does not help if
the Int# leaks through the API: (i) I could still have programs that
work in Hugs and fail to compile in GHC (and vice versa), (ii) if I
want to use your map library, I would have to compile all my programs
with the CPP extensions, which is not nice.
Hope this helps,
-Iavor
On 6/16/08, Jamie Brandon
Hi, I'm writing a library for generic tries for the Summer of Code. The main point of this post is to get some feedback on the api but I'll briefly explain the idea.
The point of a trie is to exploit the recursive nature of ADTs to save on expensive key comparisons and reduce space consumption. Hinze' original formulation is very elegant but results in very deep structures and is fairly inefficient. The normal trie optimisations (concatenating singleton maps, mainly) cant be applied to the generic version.
I intend instead to encode ADTs as lists. This encoding can range from a simple walk of the ADT to creating a compressed, bit packing representation. The resulting tradeoff between encoding time and space usage should make the design fairly flexible. This will look something like:
class Serial k where
-- | Flattened form of key consists of a list of Nodes -- Node will be Int or UArray Int for compressed implementations type Node k
-- | Flatten to a list of Nodes serialise :: k -> [Node k]
-- | Reconstruct from a list of Nodes unserialise :: [Node k] -> k
The api below is mostly cribbed from Adrian Hey's initial design. Guarantees about ordering will probably vary between maps. All ascending have descending version too. Strict versions of functions will be written where appropriate, I've omitted them here for brevity. Key reconstruction is likely to be expensive so it may make more sense to seperate foldrKeys and friends into a seperate class.
Adrian has written instance of GMap for lists, UInts and Ord types so I can declare instance (Serial k) => GMap (ListGMap (Node k)) k where ... etc
I should have a spot on code.haskell.org soon, at which point I'll put up a Haddock page with the most up to date version of the api.
class GMap map where
type k
-- | The empty map. empty :: map a
-- | Create a map with a single association. singleton :: k -> a -> map a
-- | Create a map from a list of associations which /must/ be in ascending order of keys -- (with /no/ duplicate keys). If in doubt use one of the safer (but slower) 'fromAssocs' functions. fromAssocsAscending :: [(k,a)] -> map a
-- | Return 'True' if the map contains no associations. isEmpty :: map a -> Bool
-- | Return 'True' if the map contains exactly one association. isSingleton :: map a -> Bool
-- | Return the value associated with the supplied key (if any). lookup :: k -> map a -> Maybe a
-- | Insert a new association in the map if there is currently no value associated with the key. -- If there is a value associated with the key then replace it with the result of -- applying the supplied function to that value. insert :: (a -> a) -> k -> a -> map a -> map a
-- | Delete the association for the supplied key (if any). delete :: k -> map a -> map a
-- | This is a combined insert\/modify\/delete operation. The argument to the supplied function -- is ('Just' a) if there is a value (a) associated with the supplied key, otherwise 'Nothing'. -- If the return value is ('Just' a'), a' becomes the new value associated with the supplied key. -- If the return value is 'Nothing', the association for the supplied key (if any) is deleted. alter :: (Maybe a -> Maybe a) -> k -> map a -> map a
-- | Evaluate the union of two maps. If the maps contain common keys then combine the -- values associated with those keys using the supplied function. The value arguments -- to this function are supplied in the same order as the map arguments. union :: (a -> a -> a) -> map a -> map a -> map a
-- | Evaluate the intersection of two maps, combining common associations using the supplied function. intersection :: (a -> b -> c) -> map a -> map b -> map c
-- | Evaluate the difference between two maps. For any key occuring in the second map, -- the corresponding association (if any) is deleted from the first map. -- The associated values in the second map are irrelevant. difference :: map a -> map b -> map a
-- | Returns true if the keys in the first map are a subset of the keys in the second map. -- (This includes the case where the key sets are identical). Note that this function does -- not examine the associated values (which are irrelevant). See 'isSubmapOf' if you -- do want associated values examined. isSubsetOf :: map a -> map b -> Bool
-- | Returns true if the keys in the first map are a subset of the keys in the second map -- and the corresponding function always returns true when applied to the values associated -- with matching keys. isSubmapOf :: (a -> b -> Bool) -> map a -> map b -> Bool
-- | Apply the supplied function to every associated value in the map. map :: (a -> b) -> map a -> map b
-- | Apply the supplied function to every association in the map, and use the result -- as the new associated value for the corresponding key. mapWithKey :: (k -> a -> b) -> map a -> map b
-- | Delete associations for which the supplied predicate returns 'False' when applied to -- the associated value. filter :: (a -> Bool) -> map a -> map a
-- | Fold right over the list of elements in ascending order of keys. -- See 'foldrElemsAscending'' for a strict version of this function. foldrElemsAscending :: (a -> b -> b) -> map a -> b -> b
-- | Fold right over the list of keys in ascending order. -- See 'foldrKeysAscending'' for a strict version of this function. foldrKeysAscending :: (k -> b -> b) -> map a -> b -> b
-- | Fold right over the list of associations in ascending order of keys. -- See 'foldrAssocsAscending'' for a strict version of this function. foldrAssocsAscending :: (k -> a -> b -> b) -> map a -> b -> b
-- | Fold over elements in un-specified order using /unboxed/ Int accumulator (with GHC). -- Defaults to boxed Int for other Haskells. Typically used for counting functions. -- Implementations are free to traverse the map in any order. -- The folded function is always applied strictly. foldElemsUINT :: (a -> UINT -> UINT) -> map a -> UINT -> UINT
In addition there a few functions which are useful for groups of maps or nested maps.
-- | Add the number of associations in a map to the supplied /unboxed/ Int (with GHC). -- Defaults to boxed Int for other Haskells. addSize :: map a -> UINT -> UINT
-- | Find the value associated with the supplied key (if any) and return the result -- of applying the supplied continuation function to that value. Useful for nested lookup. lookupCont :: (a -> Maybe b) -> k -> map a -> Maybe b
-- | Reject empty maps (return Nothing). nonEmpty :: map a -> Maybe (map a)
The following functions are useful internally as most of the maps are defined in terms of simpler maps. Also see this thread on the need for unionMaybe.
-- | Similar to 'insert', but the association is deleted if the supplied function returns 'Nothing'. -- (The supplied function is always applied strictly.) insertMaybe :: (a -> Maybe a) -> k -> a -> map a -> map a
-- | Find the value associated with the supplied key (if any) and apply the supplied function -- to that value. Delete the association if the result is 'Nothing'. Replace the old value with -- the new value if the result is ('Just' something). -- (The supplied function is always applied strictly.) deleteMaybe :: (a -> Maybe a) -> k -> map a -> map a
-- | Evaluate the union of two maps, but delete combined associations from the result map -- if the combining function returns 'Nothing'. -- (The combining function is always applied strictly.) unionMaybe :: (a -> a -> Maybe a) -> map a -> map a -> map a
-- | Evaluate the intersection of two maps, but delete combined associations from the result map -- if the combining function returns 'Nothing'. -- (The combining function is always applied strictly.) intersectionMaybe :: (a -> b -> Maybe c) -> map a -> map b -> map c
-- | Difference with a combining function. If the combining function returns -- @Just a@ then the corresponding association is not deleted from the result map -- (it is retained with @a@ as the associated value). differenceMaybe :: (a -> b -> Maybe a) -> map a -> map b -> map a
-- | Apply the supplied function to every associated value in the map. -- If the result is 'Nothing' then the delete the corresponding association. -- (The supplied function is always applied strictly.) mapMaybe :: (a -> Maybe b) -> map a -> map b
Thanks
Jamie _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Hello Iavor, Iavor Diatchki wrote:
thanks for doing this---I think that the functions in the API look good. Of course, one can always think of more :-) One that I often find useful when working with Data.Map is a function that removes a key from a map and also returns the associated value---a kind of destructive lookup.
That would be good. I guess you could implement that using the OMap idea I just posted, or you could have a specialised lookupDelete method.
I think that the library would be better if you exposed a non-overloaded API for the map(s?) that you plan to implement. This is useful because (i) most of the type I anticipiate using a single type of map, and concrete types lead to more accurate types in type inference, (ii) it makes the API Haskell'98 and hence likely to work on most Haskell implementations out there. If you really think that the overloading is useful, then you could provide separate modules that provide the class+instances in terms of the non-overloaded API and add a cabal flag to turn these on/off.
I agree, all the hand written instances I've written so far do expose the non-overloaded function names. The instance declarations are just simple renamings. However, the polymorphic versions still have appropriate class constraints so probably still won't work with any Haskell that doesn't understand the class. Things may not be so simple with automated deriving because of the difficulty of adding to module export lists. But I guess with bit of that cpp hackery thrown in this is not too much of a problem :-)
Also, as others mentioned, using Int# in an API is not great because this is a GHC internal type. If you found situatuins where GHC is not unboxing something please send a mail to the GHC team---they are usually very good at fixing these sorts of things. Adrian mentioned that you use CPP to control the use of Int# but this does not help if the Int# leaks through the API: (i) I could still have programs that work in Hugs and fail to compile in GHC (and vice versa), (ii) if I want to use your map library, I would have to compile all my programs with the CPP extensions, which is not nice.
Well I may be wrong, but AFAIK as we're talking about a class method here if boxed is what you specify, boxed is what you will get. It *might* be converted to unboxed if you use a SPECIALISE pragma, but this kind of optimisation depends on strictness analysis. It's very easy to end up writing something that is non-strict (or is strict but ghc can't see it) and lose the unboxing. I haven't been keeping up with H', but I would hope that support for unboxed (or should I say unpointed?) types would be part of that. Regards -- Adrian Hey

Also, as others mentioned, using Int# in an API is not great because this is a GHC internal type. If you found situatuins where GHC is not unboxing something please send a mail to the GHC team---they are usually very good at fixing these sorts of things. Adrian mentioned that you use CPP to control the use of Int# but this does not help if the Int# leaks through the API: (i) I could still have programs that work in Hugs and fail to compile in GHC (and vice versa), (ii) if I want to use your map library, I would have to compile all my programs with the CPP extensions, which is not nice.
Well I may be wrong, but AFAIK as we're talking about a class method here if boxed is what you specify, boxed is what you will get. It *might* be converted to unboxed if you use a SPECIALISE pragma, but this kind of optimisation depends on strictness analysis. It's very easy to end up writing something that is non-strict (or is strict but ghc can't see it) and lose the unboxing.
this is where we need, hmm, if in the class there was size :: map a -> Int# one would like something such as size :: map a -> {-# UNPACK #-} Int to work, since that unboxing doesn't change the semantics (results are implicitly lifted...). But as it is addSize :: map a -> UINT -> UINT foldElemsUINT :: (a -> UINT -> UINT) -> map a -> UINT -> UINT we also need something like John Meacham's ! and ~ stuff in class methods, to force strictness (because that changes the semantics). It could get ugly, since GHC tries to preserve sharing by default, so we always need UNPACK: addSize :: map a -> {-#UNPACK#-}!Int -> {-#UNPACK#-}Int Reminds me a little of http://hackage.haskell.org/trac/ghc/ticket/1349 because functions in data and in classes suffer the same "openness" problem that you can't know, given separate compilation, what functions are going to be used to implement that - so you can't optimize for them so well. -Isaac

On Sat, 21 Jun 2008, Iavor Diatchki wrote:
Hi Jamie,
thanks for doing this---I think that the functions in the API look good. Of course, one can always think of more :-) One that I often find useful when working with Data.Map is a function that removes a key from a map and also returns the associated value---a kind of destructive lookup.
I think that the library would be better if you exposed a non-overloaded API for the map(s?) that you plan to implement. This is useful because (i) most of the type I anticipiate using a single type of map, and concrete types lead to more accurate types in type inference, (ii) it makes the API Haskell'98 and hence likely to work on most Haskell implementations out there. If you really think that the overloading is useful, then you could provide separate modules that provide the class+instances in terms of the non-overloaded API and add a cabal flag to turn these on/off.
That's certainly not a good idea, since other modules must be sure, that if package generic-tries is installed that then the advanced API is available. Thus the advanced API must be moved to a separate package. (I prefer that separation.)

Ive added the rest of Adrian's functions to the api, largely without change. Still living at http://code.haskell.org/gmap/api/GMap.html The Serial code is now *finished* and tested and is living at http://code.haskell.org/gmap/serial . The user level interface is largely the same as Data.Binary, on which it is based: *Data.Serial> encode "supercalifragilisticexpialidociousness" :: [Word] [1401017574,815320646,2442711660,441247251,3504761142,212218697,3261675550,3863169456,1865717965,1399966954,6082150] I'll get to work on some benchmark now so I can see how efficiently the simple version of the api can be implemented. It doesnt really seem worth arguing over it until we have some numbers. I would hope that if merge and the associated comparison functions are declared inlinable then ghc should be able to mostly recover the original, efficient implementations. Cheers Jamie
participants (9)
-
Adrian Hey
-
apfelmus
-
Bas van Dijk
-
Henning Thielemann
-
Ian Lynagh
-
Iavor Diatchki
-
Isaac Dupree
-
Jamie Brandon
-
Jeffrey Yasskin