
On Sat, 2008-10-18 at 22:26 +0200, Thomas Hartman wrote:
{-# LANGUAGE BangPatterns #-} import qualified Data.Map as M import Debug.Trace {- I'm trying to run a HAppS web site with a large amount of data: stress testing happstutorial.com. Well, 20 million records doesn't sound that large by today's standards, but anyway that's my goal for now. I have a standard Data.Map.Map as the base structure for one of my macid data tables (jobs), but I noticed something that is probably causing problems for me. Even a simple 20 million record with int/int key values causes an out of memory error for me in ghci, on a computer with 256M of ram. I'm wondering if there is a data structure that might be more suitable for large recordsets. Or do you just have to use a database, or some sort of file-based serialization, once your records are in the millions? Or is this some weird subtlety of lazy evalution, or some other haskell gotcha? -}
In GHC, a linked list is about 12 bytes per cons (on a 32-bit computer). Let's say you had a linked list of Ints, 20 million elements long. That's 240 million bytes. Data.Map probably has a higher per element memory cost. You could probably use/make a data structure that stores the data more densely, but even a flat array of 20 million Ints on a 32-bit machine is approximately 80MB, on a 64-bit machine 160MB. Note, that by today's standards 256MB of memory is no memory at all.
size = 2 * 10^7
-- out of memory error t = (M.! size) . myFromList . map (\i->(i,i)) $ [1..size]
-- Lists are no problem {- *Main> :! time ghc -e tL testMap.hs (20000000,20000000) 3.38user 0.09system 0:03.53elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k -} tL = (!! (size-1)) . map (\i->(i,i)) $ [1..size]
t2 = (M.fromList . map (\i->(i,i)) $ [1..10] ) M.\\ (M.fromList . map (\i->(i,i)) $ [6..15])
-- does this evaluate all of list l, or just whnf? myFromList (!l) = M.fromList l
tL can garbage collect the list as it goes along and runs in constant memory due to laziness. It may even be deforested leading to no heap allocation at all. As I mentioned above, all of a 20 million element long list probably wouldn't fit in (physical) memory on your computer. In t2's case, the entire Map is built and in memory.