Re: Proposal for Data.Graph : Improve graph creation complexity when nodes have "consecutive" keys

(Sending this back to the libraries@ list as well.)
On 2 April 2018 at 23:59, Olivier S.
2018-04-02 14:34 GMT+02:00 Ivan Lazar Miljenovic
: On 2 April 2018 at 21:30, Olivier S.
wrote: Hello,
I'm resending this proposal, which is simplified w.r.t the first one, and where I removed a wrong analysis of a benchmark.
- Proposal I : Optimize the time complexity of (key -> Maybe Vertex) lookups and graph creation when keys are Integral and consecutive.
(The related PR for proposal I is [1], including benchmarks showing the performance improvements.)
Currently, (key -> Maybe Vertex) lookups returned by graphFromEdges consist of a binary search on an array, with a time complexity of O(log V) (I will use V for "Count of vertices", E for "Count of edges").
At the risk of bikeshedding, can you please use |V| and |E| to refer to the order and size of the graph respectively?
Do you mean in the haddock documentation for complexities or here? I don't know which is mor readable, O( (V+E) * log V ) or O( (|V|+|E|) * log |V| ). Anyway it would be a quick change in the PR, I'm not particularly attached to the notation.
Both. |V| and |E| are more standard for this, as V and E represent the vertices and edges themselves.
When key is Integral, and keys (of nodes passed to the graph creation function) form a set of /consecutive/ values (for example : [4,5,6,7] or [5,6,4,7]), we can have an O(1) lookup by substracting the value of the smallest key, and checking for bounds.
I'm not sure I follow this part; are you ignoring order in these lists (you're referring to sets but using list notation)?
I'm not ignoring the order, let me try to give a more precise definition:
keys is a list of consecutive keys iff it verifies:
-- (1) keys contains no duplicates Set.size (Set.fromList keys) == length keys
-- (2) there is no "gap" between values, when sorted: sort keys == [minimum keys .. maximum keys]
The O(1) lookup is at line 516 of Data/Graph.hs in https://github.com/haskell/containers/pull/549/files (key_vertex)
Hence, graph creation complexity is improved, and user algorithms using (key -> Maybe Vertex) lookups will see their complexity reduced by a factor of up-to O(log V).
The PR introduces this lookup and uses it for functions graphFromEdgesWithConsecutiveKeys and graphFromEdgesWithConsecutiveAscKeys.
Here is a summary of complexities for (graph creation, lookup function) as they stand in the current state of the PR:
- graphFromEdges (the currently existing function): O( (V+E) * log V ), O(log V) - graphFromEdgesWithConsecutiveKeys (new function): O( E + (V*log V) ), O(1) - graphFromEdgesWithConsecutiveAscKeys (new function) : O( V+E ), O(1)
- Proposal II : Deprecate `graphFromEdges` taking [(node, key, [key])] in favor of `graphFromMap` taking (Map key (node,[key]))
If we pass the same key twice in the list we pass to 'graphFromEdges' it is undefined which node for that key will actually be used. Introducing 'graphFromMap', taking a (Map key (node,[key]) would alleviate this issue, through the type used.
Off the top of my head, I'm not a big fan of this. If we're going to improve this, then I'd prefer to do so in such a way that allowed for usage with IntMap
Yes, IntMap seems to be better wrt performances than Map. Quoting the doc of IntMap:
This data structure performs especially well on binary operations like union and intersection. However, my benchmarks show that it is also (much) faster on insertions and deletions when compared to a generic size-balanced map implementation (see Data.Map).
(is there an existing type-class that covers association list-style data structures?).
There is the Map type-class (which I just discovered) in :
https://hackage.haskell.org/package/collections-api-1.0.0.0/docs/Data-Collec...
Except that it's in another library ;-)
With instances defined here, but only for Lazy versions Data.Map and Data.IntMap:
Note that the data structures for the Lazy and Strict variants of [Int]Map are the same, it's just the strictness of the functions that operate on them that differ.
https://hackage.haskell.org/package/collections-base-instances-1.0.0.0/docs/...
Also, the type-class class doesn't have toAscList (or toList) functions (which is what we would use in the implementation).
So if we want to rely on this we would need to implement toAscList, and probably add instances for Strict maps (Data.IntMap.Strict, Data.Map.Strict)
Ideally you could also use HashMap from unordered-containers as well, but since we ultimately want `type Vertex = Int` I'm not sure if that's worth it; IntMap, however, is.
I see another problem with HashMap : it doesn't provide a toAscList function where the keys are sorted, so we would have to sort them, incurring a fixed O(V log V) cost, whereas with Map and IntMap the user has the possibility to create the map from an ascending list (fromAscList), in O(V) time and we can get the list back also (toAscList) in O(V) time.
Also, using a Map makes the implementation a bit more "natural" : there is no need for sorting by key, as Map.toAscList gives exactly the sorted list we want.
We could also deprecate graphFromEdgesWithConsecutiveKeys and graphFromEdgesWithConsecutiveAscKeys (introduced in proposal I) in favor of graphFromConsecutiveMap.
About the naming, I propose two different schemes:
Either: - graphFromEdges (takes a List, deprecated, existing function) - graphFromEdgesInMap (takes a Map) - graphFromEdgesInConsecutiveMap (takes a Map with consecutive keys) Or: - graphFromEdges (takes a List, deprecated, existing function) - graphFromMap - graphFromConsecutiveMap with these, to reflect the Map / List duality in the naming scheme: - graphFromList (takes a List, deprecated, redirects to graphFromEdges) - graphFromConsecutiveList (takes a List, deprecated, redirects to graphFromEdgesWithConsecutiveKeys) - graphFromConsecutiveAscList (takes a List, deprecated, redirects to graphFromEdgesWithConsecutiveAscKeys)
Cheers, Olivier Sohn
[1] https://github.com/haskell/containers/pull/549
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

2018-04-03 0:18 GMT+02:00 Ivan Lazar Miljenovic
(Sending this back to the libraries@ list as well.)
On 2 April 2018 at 23:59, Olivier S.
wrote: 2018-04-02 14:34 GMT+02:00 Ivan Lazar Miljenovic
: On 2 April 2018 at 21:30, Olivier S.
wrote: Hello,
I'm resending this proposal, which is simplified w.r.t the first one, and where I removed a wrong analysis of a benchmark.
- Proposal I : Optimize the time complexity of (key -> Maybe Vertex) lookups and graph creation when keys are Integral and consecutive.
(The related PR for proposal I is [1], including benchmarks showing
the
performance improvements.)
Currently, (key -> Maybe Vertex) lookups returned by graphFromEdges consist of a binary search on an array, with a time complexity of O(log V) (I will use V for "Count of vertices", E for "Count of edges").
At the risk of bikeshedding, can you please use |V| and |E| to refer to the order and size of the graph respectively?
Do you mean in the haddock documentation for complexities or here? I don't know which is mor readable, O( (V+E) * log V ) or O( (|V|+|E|) * log |V| ). Anyway it would be a quick change in the PR, I'm not particularly attached to the notation.
Both. |V| and |E| are more standard for this, as V and E represent the vertices and edges themselves.
When key is Integral, and keys (of nodes passed to the graph creation function) form a set of /consecutive/ values (for example : [4,5,6,7] or [5,6,4,7]), we can have an O(1) lookup by substracting the value of the smallest key, and checking for bounds.
I'm not sure I follow this part; are you ignoring order in these lists (you're referring to sets but using list notation)?
I'm not ignoring the order, let me try to give a more precise definition:
keys is a list of consecutive keys iff it verifies:
-- (1) keys contains no duplicates Set.size (Set.fromList keys) == length keys
-- (2) there is no "gap" between values, when sorted: sort keys == [minimum keys .. maximum keys]
The O(1) lookup is at line 516 of Data/Graph.hs in https://github.com/haskell/containers/pull/549/files (key_vertex)
Hence, graph creation complexity is improved, and user algorithms
using
(key -> Maybe Vertex) lookups will see their complexity reduced by a factor of up-to O(log V).
The PR introduces this lookup and uses it for functions graphFromEdgesWithConsecutiveKeys and graphFromEdgesWithConsecutiveAscKeys.
Here is a summary of complexities for (graph creation, lookup function) as they stand in the current state of the PR:
- graphFromEdges (the currently existing function): O( (V+E) * log V ), O(log V) - graphFromEdgesWithConsecutiveKeys (new function): O( E + (V*log V) ), O(1) - graphFromEdgesWithConsecutiveAscKeys (new function) : O( V+E ), O(1)
- Proposal II : Deprecate `graphFromEdges` taking [(node, key, [key])] in favor of `graphFromMap` taking (Map key (node,[key]))
If we pass the same key twice in the list we pass to 'graphFromEdges' it is undefined which node for that key will actually be used. Introducing 'graphFromMap', taking a (Map key (node,[key]) would alleviate this issue, through the type used.
Off the top of my head, I'm not a big fan of this. If we're going to improve this, then I'd prefer to do so in such a way that allowed for usage with IntMap
Yes, IntMap seems to be better wrt performances than Map. Quoting the doc of IntMap:
This data structure performs especially well on binary operations like union and intersection. However, my benchmarks show that it is also (much) faster on insertions and deletions when compared to a generic size-balanced map implementation (see Data.Map).
(is there an existing type-class that covers association list-style data structures?).
There is the Map type-class (which I just discovered) in :
https://hackage.haskell.org/package/collections-api-1.0.0. 0/docs/Data-Collections.html#g:4
Except that it's in another library ;-)
So it seems using Data.IntMap would be a good compromise?
With instances defined here, but only for Lazy versions Data.Map and Data.IntMap:
Note that the data structures for the Lazy and Strict variants of [Int]Map are the same, it's just the strictness of the functions that operate on them that differ.
That's interesting, I wasn't aware of this.
instances-1.0.0.0/docs/Data-Collections-BaseInstances.html
Also, the type-class class doesn't have toAscList (or toList) functions (which is what we would use in the implementation).
So if we want to rely on this we would need to implement toAscList, and probably add instances for Strict maps (Data.IntMap.Strict,
Data.Map.Strict)
Ideally you could also use HashMap from unordered-containers as well, but since we ultimately want `type Vertex = Int` I'm not sure if that's worth it; IntMap, however, is.
I see another problem with HashMap : it doesn't provide a toAscList
where the keys are sorted, so we would have to sort them, incurring a fixed O(V log V) cost, whereas with Map and IntMap the user has the
create the map from an ascending list (fromAscList), in O(V) time and we can get the list back also (toAscList) in O(V) time.
Also, using a Map makes the implementation a bit more "natural" :
function possibility to there
is no need for sorting by key, as Map.toAscList gives exactly the sorted list we want.
We could also deprecate graphFromEdgesWithConsecutiveKeys and graphFromEdgesWithConsecutiveAscKeys (introduced in proposal I) in favor of graphFromConsecutiveMap.
About the naming, I propose two different schemes:
Either: - graphFromEdges (takes a List, deprecated, existing function) - graphFromEdgesInMap (takes a Map) - graphFromEdgesInConsecutiveMap (takes a Map with consecutive keys) Or: - graphFromEdges (takes a List, deprecated, existing function) - graphFromMap - graphFromConsecutiveMap with these, to reflect the Map / List duality in the naming scheme: - graphFromList (takes a List, deprecated, redirects to graphFromEdges) - graphFromConsecutiveList (takes a List, deprecated, redirects to graphFromEdgesWithConsecutiveKeys) - graphFromConsecutiveAscList (takes a List, deprecated, redirects to graphFromEdgesWithConsecutiveAscKeys)
Cheers, Olivier Sohn
[1] https://github.com/haskell/containers/pull/549
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

On 3 April 2018 at 09:24, Olivier S.
So it seems using Data.IntMap would be a good compromise?
IntMap only works if `node ~ Int`; otherwise we lose generality. My preferences are: * A typeclass (unfortunately Foldable for a Map is only on the values, not on the Key; unless you provide a wrapper?) * Lists (despite the issues you've pointed out, people can always convert Maps, IntMaps, etc. to lists to convert those values to a Graph) * Map -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

2018-04-03 1:51 GMT+02:00 Ivan Lazar Miljenovic
On 3 April 2018 at 09:24, Olivier S.
wrote: So it seems using Data.IntMap would be a good compromise?
IntMap only works if `node ~ Int`; otherwise we lose generality.
I was under the impression that we can replace [(node, key, [key])], by IntMap (node, [Int]), node being anything we want. Is it not true?
My preferences are:
* A typeclass (unfortunately Foldable for a Map is only on the values, not on the Key; unless you provide a wrapper?) * Lists (despite the issues you've pointed out, people can always convert Maps, IntMaps, etc. to lists to convert those values to a Graph) * Map
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

On 3 April 2018 at 10:11, Olivier S.
2018-04-03 1:51 GMT+02:00 Ivan Lazar Miljenovic
: On 3 April 2018 at 09:24, Olivier S.
wrote: So it seems using Data.IntMap would be a good compromise?
IntMap only works if `node ~ Int`; otherwise we lose generality.
I was under the impression that we can replace [(node, key, [key])], by IntMap (node, [Int]), node being anything we want. Is it not true?
I've never used Data.Graph in anger, but my understanding is that in the most polymorphic sense you may consider the equivalent of `Map node [key]` along with a `node -> key` function. For example: `data MyNode key = MyNode { nodeID :: key, edges :: [key]}`; then you could have `graphFromEdges . map (\mn -> (mn, nodeID mn, edges mn))`. At this point in time the actual type of `node` is no longer useful, so having `graphFromEdges` consume a `Map key [key]` may suffice... but then you can't get back the original node type to map back to your original values. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

I think in your example, nodes are `()` by the current meaning of node in
graphFromEdges.
To be closer to what graphFromEdges does today, we should change it to:
`data MyNode key nodeData = MyNode nodeData key [key]`
2018-04-03 2:54 GMT+02:00 Ivan Lazar Miljenovic
On 3 April 2018 at 10:11, Olivier S.
wrote: 2018-04-03 1:51 GMT+02:00 Ivan Lazar Miljenovic <
ivan.miljenovic@gmail.com>:
On 3 April 2018 at 09:24, Olivier S.
wrote: So it seems using Data.IntMap would be a good compromise?
IntMap only works if `node ~ Int`; otherwise we lose generality.
I was under the impression that we can replace [(node, key, [key])], by IntMap (node, [Int]), node being anything we want. Is it not true?
I've never used Data.Graph in anger, but my understanding is that in the most polymorphic sense you may consider the equivalent of `Map node [key]` along with a `node -> key` function. For example: `data MyNode key = MyNode { nodeID :: key, edges :: [key]}`; then you could have `graphFromEdges . map (\mn -> (mn, nodeID mn, edges mn))`.
At this point in time the actual type of `node` is no longer useful, so having `graphFromEdges` consume a `Map key [key]` may suffice... but then you can't get back the original node type to map back to your original values.
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

On Tue, 3 Apr. 2018, 5:31 pm Olivier S.,
I think in your example, nodes are `()` by the current meaning of node in graphFromEdges.
To be closer to what graphFromEdges does today, we should change it to:
`data MyNode key nodeData = MyNode nodeData key [key]`
No, `node ~ MyNode` in my example. Whilst the key may be the unique identifier, the node type itself may be richer (and contain the key!).
2018-04-03 2:54 GMT+02:00 Ivan Lazar Miljenovic
:
On 3 April 2018 at 10:11, Olivier S.
wrote: 2018-04-03 1:51 GMT+02:00 Ivan Lazar Miljenovic <
ivan.miljenovic@gmail.com>:
On 3 April 2018 at 09:24, Olivier S.
wrote: So it seems using Data.IntMap would be a good compromise?
IntMap only works if `node ~ Int`; otherwise we lose generality.
I was under the impression that we can replace [(node, key, [key])], by IntMap (node, [Int]), node being anything we want. Is it not true?
I've never used Data.Graph in anger, but my understanding is that in the most polymorphic sense you may consider the equivalent of `Map node [key]` along with a `node -> key` function. For example: `data MyNode key = MyNode { nodeID :: key, edges :: [key]}`; then you could have `graphFromEdges . map (\mn -> (mn, nodeID mn, edges mn))`.
At this point in time the actual type of `node` is no longer useful, so having `graphFromEdges` consume a `Map key [key]` may suffice... but then you can't get back the original node type to map back to your original values.
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com
--
Ivan Miljenovic On mobile; please excuse any tpyos

2018-04-03 9:50 GMT+02:00 Ivan Lazar Miljenovic
On Tue, 3 Apr. 2018, 5:31 pm Olivier S.,
wrote: I think in your example, nodes are `()` by the current meaning of node in graphFromEdges.
To be closer to what graphFromEdges does today, we should change it to:
`data MyNode key nodeData = MyNode nodeData key [key]`
No, `node ~ MyNode` in my example. Whilst the key may be the unique identifier, the node type itself may be richer (and contain the key!).
Just wanted to point out that if the node contains some data which is neither key not list of neighbours, then `Map key [key]` won't be enough.
2018-04-03 2:54 GMT+02:00 Ivan Lazar Miljenovic < ivan.miljenovic@gmail.com>:
On 3 April 2018 at 10:11, Olivier S.
wrote: 2018-04-03 1:51 GMT+02:00 Ivan Lazar Miljenovic <
ivan.miljenovic@gmail.com>:
On 3 April 2018 at 09:24, Olivier S.
wrote: So it seems using Data.IntMap would be a good compromise?
IntMap only works if `node ~ Int`; otherwise we lose generality.
I was under the impression that we can replace [(node, key, [key])], by IntMap (node, [Int]), node being anything we want. Is it not true?
I've never used Data.Graph in anger, but my understanding is that in the most polymorphic sense you may consider the equivalent of `Map node [key]` along with a `node -> key` function. For example: `data MyNode key = MyNode { nodeID :: key, edges :: [key]}`; then you could have `graphFromEdges . map (\mn -> (mn, nodeID mn, edges mn))`.
At this point in time the actual type of `node` is no longer useful, so having `graphFromEdges` consume a `Map key [key]` may suffice... but then you can't get back the original node type to map back to your original values.
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com
--
Ivan Miljenovic On mobile; please excuse any tpyos
participants (2)
-
Ivan Lazar Miljenovic
-
Olivier S.