Grouping - Map / Reduce

Hi, let say I got an unordered lazy list of key/value pairs like [('a', 99), ('x', 42), ('a', 33) ... ] and I need to sum up all the values with the same keys. So far I wrote a naive implementation, using Data.Map, foldl and insertWith. The result of this grouping operation, which is effectively another list of key/value pairs, just sums this time, needs to be further processed. The building of this map is of course a bottleneck, the successive processing needs to wait until the entire list is eventually consumed the Map is built and flattened again. Is there another way of doing this, something more "streaming architecture" like? Is Googles "Map - Reduce" related to this? Günther

On Tue, Mar 24, 2009 at 3:15 PM, Gü?nther Schmidt
Hi,
let say I got an unordered lazy list of key/value pairs like
[('a', 99), ('x', 42), ('a', 33) ... ]
and I need to sum up all the values with the same keys.
So far I wrote a naive implementation, using Data.Map, foldl and insertWith.
The result of this grouping operation, which is effectively another list of key/value pairs, just sums this time, needs to be further processed.
The building of this map is of course a bottleneck, the successive processing needs to wait until the entire list is eventually consumed the Map is built and flattened again.
Is there another way of doing this, something more "streaming architecture" like?
Yeah, make a trie. Here's a quick example. import Data.Monoid newtype IntTrie a = IntTrie [a] singleton :: (Monoid a) => Int -> a -> IntTrie a singleton ch x = IntTrie [ if fromIntegral ch == i then x else mempty | i <- [0..] ] lookupTrie :: IntTrie a -> Int -> a lookupTrie (IntTrie xs) n = xs !! n instance (Monoid a) => Monoid (IntTrie a) where mempty = IntTrie (repeat mempty) mappend (IntTrie xs) (IntTrie ys) = IntTrie (infZipWith mappend xs ys) infZipWith f ~(x:xs) ~(y:ys) = f x y : infZipWith f xs ys test = mconcat [ singleton (n `mod` 42) [n] | n <- [0..] ] `lookupTrie` 10 This is an inefficient way to find the class of n such that n mod 42 = 10. Note that it works on an infinite list of inputs. Here the "trie" was a simple list, but you could replace it with a more advanced data structure for better performace. Luke

On Tue, Mar 24, 2009 at 3:51 PM, Luke Palmer
On Tue, Mar 24, 2009 at 3:15 PM, Gü?nther Schmidt
wrote: Hi,
let say I got an unordered lazy list of key/value pairs like
[('a', 99), ('x', 42), ('a', 33) ... ]
and I need to sum up all the values with the same keys.
So far I wrote a naive implementation, using Data.Map, foldl and insertWith.
The result of this grouping operation, which is effectively another list of key/value pairs, just sums this time, needs to be further processed.
The building of this map is of course a bottleneck, the successive processing needs to wait until the entire list is eventually consumed the Map is built and flattened again.
Is there another way of doing this, something more "streaming architecture" like?
Yeah, make a trie. Here's a quick example.
import Data.Monoid
newtype IntTrie a = IntTrie [a]
singleton :: (Monoid a) => Int -> a -> IntTrie a singleton ch x = IntTrie [ if fromIntegral ch == i then x else mempty | i <- [0..] ]
This definition of singleton unnecessarily leaks memory in some cases. Here's a better one: singleton ch x = IntTrie $ replicate ch mempty ++ [x] ++ repeat mempty Luke
lookupTrie :: IntTrie a -> Int -> a lookupTrie (IntTrie xs) n = xs !! n
instance (Monoid a) => Monoid (IntTrie a) where mempty = IntTrie (repeat mempty) mappend (IntTrie xs) (IntTrie ys) = IntTrie (infZipWith mappend xs ys)
infZipWith f ~(x:xs) ~(y:ys) = f x y : infZipWith f xs ys
test = mconcat [ singleton (n `mod` 42) [n] | n <- [0..] ] `lookupTrie` 10
This is an inefficient way to find the class of n such that n mod 42 = 10. Note that it works on an infinite list of inputs.
Here the "trie" was a simple list, but you could replace it with a more advanced data structure for better performace.
Luke

Dear Luke, I suspect Black Magic at work here. This seems to work and I so don't have a clue why. But thank you very much nevertheless, I strongly suspect that, once I figured out why this works, I will have learned a very, very important trick indeed. Had I read "purely functional data structures" from start to finish, would I have come across this? Günther Luke Palmer schrieb:
On Tue, Mar 24, 2009 at 3:51 PM, Luke Palmer
mailto:lrpalmer@gmail.com> wrote: On Tue, Mar 24, 2009 at 3:15 PM, Gü?nther Schmidt
mailto:gue.schmidt@web.de> wrote: Hi,
let say I got an unordered lazy list of key/value pairs like
[('a', 99), ('x', 42), ('a', 33) ... ]
and I need to sum up all the values with the same keys.
So far I wrote a naive implementation, using Data.Map, foldl and insertWith..
The result of this grouping operation, which is effectively another list of key/value pairs, just sums this time, needs to be further processed.
The building of this map is of course a bottleneck, the successive processing needs to wait until the entire list is eventually consumed the Map is built and flattened again.
Is there another way of doing this, something more "streaming architecture" like?
Yeah, make a trie. Here's a quick example.
import Data.Monoid
newtype IntTrie a = IntTrie [a]
singleton :: (Monoid a) => Int -> a -> IntTrie a singleton ch x = IntTrie [ if fromIntegral ch == i then x else mempty | i <- [0..] ]
This definition of singleton unnecessarily leaks memory in some cases. Here's a better one:
singleton ch x = IntTrie $ replicate ch mempty ++ [x] ++ repeat mempty
Luke
lookupTrie :: IntTrie a -> Int -> a lookupTrie (IntTrie xs) n = xs !! n
instance (Monoid a) => Monoid (IntTrie a) where mempty = IntTrie (repeat mempty) mappend (IntTrie xs) (IntTrie ys) = IntTrie (infZipWith mappend xs ys)
infZipWith f ~(x:xs) ~(y:ys) = f x y : infZipWith f xs ys
test = mconcat [ singleton (n `mod` 42) [n] | n <- [0..] ] `lookupTrie` 10
This is an inefficient way to find the class of n such that n mod 42 = 10. Note that it works on an infinite list of inputs.
Here the "trie" was a simple list, but you could replace it with a more advanced data structure for better performace.
Luke
------------------------------------------------------------------------
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Dear Luke, I'm at a loss trying to figure out what is happening here, I'd sincerely appreciate it if you could find the time to give me more clues on this, I'm deeply impressed. If I think that what is happening *is* happening that would mean that this is a way to group in almost constant space. Günther

Luke Palmer wrote:
On Tue, Mar 24, 2009 at 3:15 PM, Gü?nther Schmidt
wrote: Hi,
let say I got an unordered lazy list of key/value pairs like
[('a', 99), ('x', 42), ('a', 33) ... ]
and I need to sum up all the values with the same keys.
So far I wrote a naive implementation, using Data.Map, foldl and insertWith.
The result of this grouping operation, which is effectively another list of key/value pairs, just sums this time, needs to be further processed.
The building of this map is of course a bottleneck, the successive processing needs to wait until the entire list is eventually consumed the Map is built and flattened again.
Is there another way of doing this, something more "streaming architecture" like?
Yeah, make a trie. Here's a quick example.
Nice! There was a thread about this question a few years ago, with some very interesting developments like the "blueprint technique" by Bertram Felgenhauer. http://thread.gmane.org/gmane.comp.lang.haskell.cafe/15135 Regards, apfelmus -- http://apfelmus.nfshost.com

On Thu, 26 Mar 2009, Heinrich Apfelmus wrote:
Luke Palmer wrote:
Yeah, make a trie. Here's a quick example.
Nice!
There was a thread about this question a few years ago, with some very interesting developments like the "blueprint technique" by Bertram Felgenhauer.
I remember that this thread yielded a Wiki page, which seems to be gone together with Hawiki. I have at least started a page on the new Wiki in order to preserve the link to that discussion: http://haskell.org/haskellwiki/Blueprint

"Gü?nther Schmidt"
let say I got an unordered lazy list of key/value pairs like
[('a', 99), ('x', 42), ('a', 33) ... ]
and I need to sum up all the values with the same keys.
So far I wrote a naive implementation, using Data.Map, foldl and insertWith.
Data.Map.fromListWith (+)
The building of this map is of course a bottleneck, the successive processing needs to wait until the entire list is eventually consumed the Map is built and flattened again.
Sure this is not an artifact of the laziness of foldl?
Is there another way of doing this, something more "streaming architecture" like?
I don't see how you can do this much better - for a small, fixed set of keys, you could use an (STU) array for the sums, but it depends if the added complexity is worth it. You're already doing a single pass over the data. -k -- If I haven't seen further, it is by standing in the footprints of giants

Hi Ketil, Ketil Malde schrieb:
"Gü?nther Schmidt"
writes: let say I got an unordered lazy list of key/value pairs like
[('a', 99), ('x', 42), ('a', 33) ... ]
and I need to sum up all the values with the same keys.
So far I wrote a naive implementation, using Data.Map, foldl and insertWith.
Data.Map.fromListWith (+)
The building of this map is of course a bottleneck, the successive processing needs to wait until the entire list is eventually consumed the Map is built and flattened again.
Sure this is not an artifact of the laziness of foldl?
well I can't really see how the map could be consumed *while* it's still being built, I just don't see it. (I'm using foldl' and insertWith', sry for not saying so initially).
Is there another way of doing this, something more "streaming architecture" like?
I don't see how you can do this much better - for a small, fixed set of keys, you could use an (STU) array for the sums, but it depends if the added complexity is worth it. You're already doing a single pass over the data.
-k

Dear Günther, the map can't be consumed while it is constructed. At any point during its construction you don't know for any key in the map if it will appear in the not cosumed rest of the list again. This means you can't process any entry of the map because it may change later. The only point when nothing will change anymore is when the map is completely constructed. Regards, Martin. Günther Schmidt schrieb:
Hi Ketil,
Ketil Malde schrieb:
"Gü?nther Schmidt"
writes: let say I got an unordered lazy list of key/value pairs like
[('a', 99), ('x', 42), ('a', 33) ... ]
and I need to sum up all the values with the same keys.
So far I wrote a naive implementation, using Data.Map, foldl and insertWith.
Data.Map.fromListWith (+)
The building of this map is of course a bottleneck, the successive processing needs to wait until the entire list is eventually consumed the Map is built and flattened again.
Sure this is not an artifact of the laziness of foldl?
well I can't really see how the map could be consumed *while* it's still being built, I just don't see it. (I'm using foldl' and insertWith', sry for not saying so initially).
Is there another way of doing this, something more "streaming architecture" like?
I don't see how you can do this much better - for a small, fixed set of keys, you could use an (STU) array for the sums, but it depends if the added complexity is worth it. You're already doing a single pass over the data.
-k
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I'm also learning Haskell so the solution below might be (1) inefficient and (2) incorrect, but hey, let's give it a try :-) For simplicity, in the testing code, I assume an infinite list of key/value pairs where they keys are of type Char between 'a' and 'z' and the values are Integers (the code also seems to work for keys with just a lower bound but no upper bound) I think the code speaks for itself import System.Random runningSumsOfValuesPerKey :: (Eq k, Num v) => [k] -> [(k, v)] -> [[v]] runningSumsOfValuesPerKey allPossibleKeys = runningSums . allValuesPerKey where runningSums = map (scanl (+) 0) allValuesPerKey pairs = [ valuesWithKey key pairs | key <- allPossibleKeys ] valuesWithKey key = map snd . filter ((==key) . fst) -- Testing randomPairs :: [(Char, Integer)] randomPairs = zip keys values where keys = randomRs ('a','z') rnd1 values = randomRs (0,9) rnd2 (rnd1,rnd2) = split (mkStdGen 0) test = map (take 10) [rs `atKey` 'c', rs `atKey` 'z'] where rs = runningSumsOfValuesPerKey ['a'..] randomPairs xs `atKey` k = xs !! (fromEnum k - fromEnum 'a')
participants (8)
-
Gü?nther Schmidt
-
Günther Schmidt
-
Heinrich Apfelmus
-
Henning Thielemann
-
Ketil Malde
-
Luke Palmer
-
Martin Huschenbett
-
Peter Verswyvelen