can there be (hash-table using) O(n) version of this (I think currently) n log n algo?

The code below is, I think, n log n, a few seconds on a million + element list. I wonder if it's possible to get this down to O(N) by using a hashtable implemementation, or other better data structure. Further, is there a hashtable implementation for haskell that doesn't live in IO? Maybe in ST or something? import Data.HashTable import qualified Data.Set as S import Data.List (foldl') testdata = [1,4,8,9,20,11,20,14,2,15] ++ [1..(10^6)] wantedsum = 29 -- set data structure -- findsums locates pairs of integers in a list that add up to a wanted sum. findsums :: [Int] -> Int -> S.Set (Int,Int) findsums xs wanted = snd . foldl' f (S.empty,S.empty) $ xs where f (candidates,successes) next = if S.member (wanted-next) candidates then (candidates, S.insert (next,wanted-next) successes) else (S.insert next candidates,successes) -- hashtable data structure -- result: t -- fromList [(15,14),(16,13),(17,12),(18,11),(19,10),(20,9),(21,8),(22,7),(23,6),(24,5),(25,4),(26,3),(27,2),(28,1)] -- probably O(n log n) complexity since using tree based Data.Set (a few seconds on million+ element list) t = findsums testdata wantedsum

Haskell hash tables are a notorious performance pig, mostly due to the fact
that when we deal with big arrays, if the mutable array changes at all the
garbage collector will have to retraverse the entire thing during the next
collection. Guess the most common scenario for imperative hash tables that
are even lightly tweaked from time to time... ;)
As for other non-IO hash tables, I've seen a couple of unboxed hash tables
using STUArrays (which can side step this issue for unboxable data), IIRC
one may have even been used for a language shootout problem. I even wrote (a
rather poorly performing) Witold Litwin-style sorted linear hash table for
STM a couple of years back (it should still be on hackage under 'thash').
Data.HashTable could be easily reimplemented in ST s, but it would still
suffer the same GC problems as the current hash table, which no one likes.
-Ed
On Fri, Jul 17, 2009 at 6:24 PM, Thomas Hartman
The code below is, I think, n log n, a few seconds on a million + element list.
I wonder if it's possible to get this down to O(N) by using a hashtable implemementation, or other better data structure.
Further, is there a hashtable implementation for haskell that doesn't live in IO? Maybe in ST or something?
import Data.HashTable import qualified Data.Set as S import Data.List (foldl')
testdata = [1,4,8,9,20,11,20,14,2,15] ++ [1..(10^6)] wantedsum = 29
-- set data structure -- findsums locates pairs of integers in a list that add up to a wanted sum. findsums :: [Int] -> Int -> S.Set (Int,Int) findsums xs wanted = snd . foldl' f (S.empty,S.empty) $ xs where f (candidates,successes) next = if S.member (wanted-next) candidates then (candidates, S.insert (next,wanted-next) successes) else (S.insert next candidates,successes)
-- hashtable data structure
-- result: t -- fromList [(15,14),(16,13),(17,12),(18,11),(19,10),(20,9),(21,8),(22,7),(23,6),(24,5),(25,4),(26,3),(27,2),(28,1)] -- probably O(n log n) complexity since using tree based Data.Set (a few seconds on million+ element list) t = findsums testdata wantedsum _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

However you can use the wider idea of hashing: A nesting of two finite maps. One fast, but approximative map. And one slow, but exact map. The quintessential example is an array indexed with some hash function for the first map. And linked lists of (key,value) pairs as the latter. In Haskell you might want to use IntMap and a the mentioned list of pairs (combined with the lookup functions from Data.List). Of course you need to supply a function to hash your keys to Int for the IntMap.

Hello Thomas, Saturday, July 18, 2009, 2:24:21 AM, you wrote:
Further, is there a hashtable implementation for haskell that doesn't live in IO? Maybe in ST or something?
import Prelude hiding (lookup) import qualified Data.HashTable import Data.Array import qualified Data.List as List data HT a b = HT (a->Int) (Array Int [(a,b)]) -- size is the size of array (we implement a closed hash) -- hash is the hash function (a->Int) -- list is assoclist of items to put in hash create size hash list = HT hashfunc (accumArray (flip (:)) [] (0, arrsize-1) (map (\(a,b) -> (hashfunc a,(a,b))) list) ) where arrsize = head$ filter (>size)$ iterate (\x->3*x+1) 1 hashfunc a = hash a `mod` arrsize lookup a (HT hash arr) = List.lookup a (arr!hash a) main = do let assoclist = [("one", 1), ("two", 2), ("three", 3)] hash = create 10 (fromEnum . Data.HashTable.hashString) assoclist print (lookup "one" hash) print (lookup "zero" hash) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Thanks Bulat.
FWIW, i take it that
http://www.haskell.org/haskellwiki/Shootout/Knucleotide
is what Edward was referring to, with the shootouts. It seems that a
lot of progress has been made but not much has been migrated back to
hackage.
Going back to my original question, I am now looking for a dead simple
motivating example for showing the example of using a (good) hashtable
over Data.Map, with a tangible demo of O(n) over O(n log n) running
times. I mean, something where running an input of (10^4) size versus
(10^6) size shows a noticeably laggier run when using Set versus
hashtable.
I don't think maybe my original example quite qualifies because I
think in practice the computation is dominated by space complexity.
However, I haven't yet ported it over to a hashtable version, so not
sure.
(And the shootout example doesn't satisfy my sense of "dead simple.")
2009/7/18 Bulat Ziganshin
Hello Thomas,
Saturday, July 18, 2009, 2:24:21 AM, you wrote:
Further, is there a hashtable implementation for haskell that doesn't live in IO? Maybe in ST or something?
import Prelude hiding (lookup) import qualified Data.HashTable import Data.Array import qualified Data.List as List
data HT a b = HT (a->Int) (Array Int [(a,b)])
-- size is the size of array (we implement a closed hash) -- hash is the hash function (a->Int) -- list is assoclist of items to put in hash create size hash list = HT hashfunc (accumArray (flip (:)) [] (0, arrsize-1) (map (\(a,b) -> (hashfunc a,(a,b))) list) )
where arrsize = head$ filter (>size)$ iterate (\x->3*x+1) 1 hashfunc a = hash a `mod` arrsize
lookup a (HT hash arr) = List.lookup a (arr!hash a)
main = do let assoclist = [("one", 1), ("two", 2), ("three", 3)] hash = create 10 (fromEnum . Data.HashTable.hashString) assoclist print (lookup "one" hash) print (lookup "zero" hash)
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Thomas, Saturday, July 18, 2009, 7:23:10 PM, you wrote:
Going back to my original question, I am now looking for a dead simple motivating example for showing the example of using a (good) hashtable over Data.Map
spell checking? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Any application where multiple updates are done in multiple threads . gain
by using a hashTable
2009/7/18 Bulat Ziganshin
Hello Thomas,
Saturday, July 18, 2009, 7:23:10 PM, you wrote:
Going back to my original question, I am now looking for a dead simple motivating example for showing the example of using a (good) hashtable over Data.Map
spell checking?
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

tphyahoo:
The code below is, I think, n log n, a few seconds on a million + element list.
Have you tried the judy arrays library on Hackage? (It provides a hashtable, which I've used occasionally) -- Don

Thomas Hartman wrote:
The code below is, I think, n log n, a few seconds on a million + element list.
I wonder if it's possible to get this down to O(N) by using a hashtable implemementation, or other better data structure.
-- findsums locates pairs of integers in a list that add up to a wanted sum.
findsums :: [Int] -> Int -> S.Set (Int,Int) findsums xs wanted = snd . foldl' f (S.empty,S.empty) $ xs where f (candidates,successes) next = if S.member (wanted-next) candidates then (candidates, S.insert (next,wanted-next) successes) else (S.insert next candidates,successes)
Remember that hash tables are actually O(key length) instead of O(1), so I don't think you can break the log n for really large lists this way since the key length increases as well (unless most elements are equal anyway). In any case, I have this conjecture On which I will lecture: A Set of some things Is sorting for kings. ;-) findsums goal = uncurry sweep . (id &&& reverse) . sort where sweep [] _ = [] sweep _ [] = [] sweep (x:xs) (y:ys) = if x > y then [] else case compare (x+y) goal of LT -> sweep xs (y:ys) EQ -> (x,y) : sweep xs ys GT -> sweep (x:xs) ys This algorithm needs a proof of correctness, though. And it's longer that the Data.Set version, too. Regards, apfelmus -- http://apfelmus.nfshost.com

On Sunday 19 July 2009 09:26:14 Heinrich Apfelmus wrote:
Thomas Hartman wrote:
The code below is, I think, n log n, a few seconds on a million + element list.
I wonder if it's possible to get this down to O(N) by using a hashtable implemementation, or other better data structure.
-- findsums locates pairs of integers in a list that add up to a wanted sum.
findsums :: [Int] -> Int -> S.Set (Int,Int) findsums xs wanted = snd . foldl' f (S.empty,S.empty) $ xs where f (candidates,successes) next = if S.member (wanted-next) candidates then (candidates, S.insert (next,wanted-next) successes) else (S.insert next candidates,successes)
Remember that hash tables are actually O(key length) instead of O(1), so I don't think you can break the log n for really large lists this way since the key length increases as well (unless most elements are equal anyway).
Use a trie of hash tables with ~word-sized pieces of key. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e

Never underestimate teh power of the Int{Set,Map}: {-# LANGUAGE BangPatterns #-} import Data.Set(Set) import Data.IntSet(IntSet) import qualified Data.Set as S import qualified Data.IntSet as IS import Control.Parallel.Strategies(rnf) import Data.Monoid(Monoid(..)) import Data.List findsumsIS :: [Int] -> Int -> IntSet findsumsIS xs wanted = snd . foldl' f mempty $ xs where f (!candidates,!successes) next = let x = wanted - next in case x `IS.member` candidates of True -> (candidates, IS.insert next successes) False -> (IS.insert next candidates,successes) -- (i had to add bangs in f since it was blowing the stack) findsums :: [Int] -> Int -> Set (Int,Int) findsums xs wanted = snd . foldl' f (S.empty,S.empty) $ xs where f (!candidates,!successes) next = if S.member (wanted-next) candidates then (candidates, S.insert (next,wanted-next) successes) else (S.insert next candidates,successes) {- Note that the list has 10 million elements, (time is roughly 0.4s with 1 million with IntSet). -} {- main = do let s = findsums (take 10000000 (cycle [1..999])) 500 print (rnf s `seq` ()) [m@monire ~]$ time ./FindSums () real 0m8.793s user 0m8.762s sys 0m0.022s -} {- main = do let s = findsumsIS (take 10000000 (cycle [1..999])) 500 print (rnf s `seq` ()) [m@monire ~]$ time ./FindSumsIS () real 0m4.488s user 0m4.459s sys 0m0.023s -} Matt
On Sunday 19 July 2009 09:26:14 Heinrich Apfelmus wrote:
Thomas Hartman wrote:
The code below is, I think, n log n, a few seconds on a million + element list.
I wonder if it's possible to get this down to O(N) by using a hashtable implemementation, or other better data structure.
-- findsums locates pairs of integers in a list that add up to a wanted sum.
findsums :: [Int] -> Int -> S.Set (Int,Int) findsums xs wanted = snd . foldl' f (S.empty,S.empty) $ xs where f (candidates,successes) next = if S.member (wanted-next) candidates then (candidates, S.insert (next,wanted-next) successes) else (S.insert next candidates,successes)
Remember that hash tables are actually O(key length) instead of O(1), so I don't think you can break the log n for really large lists this way since the key length increases as well (unless most elements are equal anyway).
participants (9)
-
Alberto G. Corona
-
Bulat Ziganshin
-
Don Stewart
-
Edward Kmett
-
Heinrich Apfelmus
-
Jon Harrop
-
Matt Morrow
-
Matthias Görgens
-
Thomas Hartman