is 256M RAM insufficient for a 20 million element Int/Int map?

{-# LANGUAGE BangPatterns #-} import qualified Data.Map as M import Debug.Trace {- I'm trying to run a HAppS web site with a large amount of data: stress testing happstutorial.com. Well, 20 million records doesn't sound that large by today's standards, but anyway that's my goal for now. I have a standard Data.Map.Map as the base structure for one of my macid data tables (jobs), but I noticed something that is probably causing problems for me. Even a simple 20 million record with int/int key values causes an out of memory error for me in ghci, on a computer with 256M of ram. I'm wondering if there is a data structure that might be more suitable for large recordsets. Or do you just have to use a database, or some sort of file-based serialization, once your records are in the millions? Or is this some weird subtlety of lazy evalution, or some other haskell gotcha? -} size = 2 * 10^7 -- out of memory error t = (M.! size) . myFromList . map (\i->(i,i)) $ [1..size] -- Lists are no problem {- *Main> :! time ghc -e tL testMap.hs (20000000,20000000) 3.38user 0.09system 0:03.53elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k -} tL = (!! (size-1)) . map (\i->(i,i)) $ [1..size] t2 = (M.fromList . map (\i->(i,i)) $ [1..10] ) M.\\ (M.fromList . map (\i->(i,i)) $ [6..15]) -- does this evaluate all of list l, or just whnf? myFromList (!l) = M.fromList l

On Sat, 2008-10-18 at 22:26 +0200, Thomas Hartman wrote:
{-# LANGUAGE BangPatterns #-} import qualified Data.Map as M import Debug.Trace {- I'm trying to run a HAppS web site with a large amount of data: stress testing happstutorial.com. Well, 20 million records doesn't sound that large by today's standards, but anyway that's my goal for now. I have a standard Data.Map.Map as the base structure for one of my macid data tables (jobs), but I noticed something that is probably causing problems for me. Even a simple 20 million record with int/int key values causes an out of memory error for me in ghci, on a computer with 256M of ram. I'm wondering if there is a data structure that might be more suitable for large recordsets. Or do you just have to use a database, or some sort of file-based serialization, once your records are in the millions? Or is this some weird subtlety of lazy evalution, or some other haskell gotcha? -}
In GHC, a linked list is about 12 bytes per cons (on a 32-bit computer). Let's say you had a linked list of Ints, 20 million elements long. That's 240 million bytes. Data.Map probably has a higher per element memory cost. You could probably use/make a data structure that stores the data more densely, but even a flat array of 20 million Ints on a 32-bit machine is approximately 80MB, on a 64-bit machine 160MB. Note, that by today's standards 256MB of memory is no memory at all.
size = 2 * 10^7
-- out of memory error t = (M.! size) . myFromList . map (\i->(i,i)) $ [1..size]
-- Lists are no problem {- *Main> :! time ghc -e tL testMap.hs (20000000,20000000) 3.38user 0.09system 0:03.53elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k -} tL = (!! (size-1)) . map (\i->(i,i)) $ [1..size]
t2 = (M.fromList . map (\i->(i,i)) $ [1..10] ) M.\\ (M.fromList . map (\i->(i,i)) $ [6..15])
-- does this evaluate all of list l, or just whnf? myFromList (!l) = M.fromList l
tL can garbage collect the list as it goes along and runs in constant memory due to laziness. It may even be deforested leading to no heap allocation at all. As I mentioned above, all of a 20 million element long list probably wouldn't fit in (physical) memory on your computer. In t2's case, the entire Map is built and in memory.

Hello Derek, Sunday, October 19, 2008, 12:40:43 AM, you wrote:
In GHC, a linked list is about 12 bytes per cons (on a 32-bit computer).
this isn't a whole picture. due to Garbage Collection, actual memory use is 2-3 times higher (although you can make this coef. as close to 1 as you want by coinfiguring more frequent GCs), so 20 million elems list will really consume 20*30 = 600 mb -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

tphyahoo:
{-# LANGUAGE BangPatterns #-} import qualified Data.Map as M import Debug.Trace {- I'm trying to run a HAppS web site with a large amount of data: stress testing happstutorial.com. Well, 20 million records doesn't sound that large by today's standards, but anyway that's my goal for now. I have a standard Data.Map.Map as the base structure for one of my macid data tables (jobs), but I noticed something that is probably causing problems for me. Even a simple 20 million record with int/int key values causes an out of memory error for me in ghci,
Int keys, Int values eh? Does using IntMap help? If not, trying using a UArray or hash structure? -- Don

Don Stewart wrote:
tphyahoo:
I'm trying to run a HAppS web site with a large amount of data: stress testing happstutorial.com. Well, 20 million records doesn't sound that large by today's standards, but anyway that's my goal for now. I have a standard Data.Map.Map as the base structure for one of my macid data tables (jobs), but I noticed something that is probably causing problems for me. Even a simple 20 million record with int/int key values causes an out of memory error for me in ghci,
Int keys, Int values eh?
Does using IntMap help?
Interesting. Map Int Int, IntMap Int and [(Int, Int)] use pretty much the same amount of memory, assuming that no keys or values are shared: Map Int Int: data Map k a = Tip | Bin {-# UNPACK #-} !Size !k a !(Map k a) !(Map k a) - all 'Tip's are shared. - For each (key, value) pair, there is one 'Bin', needing tag[*] (1 word) size (unpacked, 1 word) key, value (pointer, tag, value = 3 words each) two more pointers (1 word each) for a total of 10 words per element. IntMap Int: data IntMap a = Nil | Tip {-# UNPACK #-} !Key a | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(IntMap a) !(IntMap a) - one 'Tip' per element: tag (1 word) + key (1 word) + value (3 words) - one 'Bin' per element (minus 1): tag (1 word) + prefix, mask (1 word each) + 2 pointers (1 word each) for a total of 10 words per element. [(Int, Int)]: - one '(:)' per element: - tag (1 word) + 2 pointers (1 word each) - one '(Int, Int)' per element: - tag (1 word) + key (3 words) + value (3 words) for a total of 10 words per element, again. Now if we want to save memory, we can specialise Data.Map to Int keys and values, saving 4 words per element, and encode the external leaves (Tips) in the constructor, saving another word: data IntIntMap = Tip | BinBB !Size !Int !Int IntIntMap IntIntMap | BinBT !Size !Int !Int IntIntMap | BinTB !Size !Int !Int IntIntMap | BinTT !Size !Int !Int -- sprinkle {-# UNPACK #-} as needed That's 5 words per elements. Would that be worthwhile? Bertram [*] actually, the info pointer in the Spineless, Tagless G-machine.

I have a standard Data.Map.Map as the base structure for one of my macid data tables (jobs), but I noticed something that is probably causing problems for me. Even a simple 20 million record with int/int key values causes an out of memory error for me in ghci, Int keys, Int values eh? Does using IntMap help? Interesting. Map Int Int, IntMap Int and [(Int, Int)] use pretty much the same amount of memory, assuming that no keys or values are shared:
I haven't followed the thread, but one thing that keeps tripping me up wrt memory use is that Maps aren't strict in their values by default. Worse, if that is not what you want for your application, working around it can be rather awkward (Data.Map at least provides insertWith', but there's no unionWith', and Data.IntMap doesn't even have insertWith'). Ideally, I'd just like to indicate the strictness in the types (hint to ghc hackers: 'Data.Map.Map Int !Int' and '[!a]' would really be useful!-), but as that isn't supported, separate operations (with seq inserted manually at suitable places) seem necessary (hint to container library maintainers: please provide full set of strict operation variants!-). If this is (or contributes to) the problem, it should show up in heap profiles as an upward ramp and could be fixed by defining and using locally augmented versions of the Map operations. Claus

On Sun, Oct 19, 2008 at 4:26 AM, Claus Reinke
(hint to ghc hackers: 'Data.Map.Map Int !Int' and '[!a]' would really be useful!-),
I can't figure out what that means though. Strictness is not a property of types or of values, it is a property of functions. [!] is not a subtype of [] ; IOW, there is no a such that [a] = [!Int] (where [!Int] is a list with strict values). For example, if we allowed this, the following property breaks: length xs == length (map f xs) Since it is not true on strict lists. Luke

(hint to ghc hackers: 'Data.Map.Map Int !Int' and '[!a]' would really be useful!-),
I can't figure out what that means though. Strictness is not a property of types or of values, it is a property of functions. [!] is not a subtype of [] ; IOW, there is no a such that [a] = [!Int] (where [!Int] is a list with strict values). For example, if we allowed this, the following property breaks:
length xs == length (map f xs)
Since it is not true on strict lists.
I'm not entirely sure this couldn't be worked out. Let's say that every type breaks into terminating and non-terminating computations: a = !a | ^a For your equation to break, you're assuming implicit conversions between some 'a' and '!a', ie, something like (undefined :: a) :: !a (const undefined :: a -> b) :: !a -> !a But those shouldn't typecheck! We can't decide termination, so not all objects of type 'a' can be classified into the subtypes '!a' or '^a'. Membership in '!a' can be constructive only, and map for strict lists would have a type somewhat like this mapS :: !(!a -> !a) -> [!a] -> [!a] Without a function, 'mapS' can't construct an element-strict result list, hence the first '!' (and 'mapS undefined' won't typecheck). Nor can it do so without a function that can construct (or pass on) strict elements, hence the '!' on the result type of the parameter function (so 'mapS (const undefined)' won't typecheck, either). Since there are no ways of producing an arbitrary '!a' out of thin air, polymorphic parameter functions will have type '!a->!a' (if 'a' gets specialised, eg to 'Int', then trivial 'Int->!Int' functions are possible). Being conjured out of thin air, none of this might not hold up under closer scrutiny, but papers like [1] suggest that it isn't entirely out of reach - further references or counterexamples appreciated!-) Claus [1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.8984 Unboxing using specialisation, Simon L. Peyton Jones and Patrick M. Sansom

On Sunday 19 October 2008 10:32:08 am Claus Reinke wrote:
(hint to ghc hackers: 'Data.Map.Map Int !Int' and '[!a]' would really be useful!-),
I can't figure out what that means though. Strictness is not a property of types or of values, it is a property of functions. [!] is not a subtype of [] ; IOW, there is no a such that [a] = [!Int] (where [!Int] is a list with strict values). For example, if we allowed this, the following property breaks:
length xs == length (map f xs)
Since it is not true on strict lists.
I'm not entirely sure this couldn't be worked out. Let's say that every type breaks into terminating and non-terminating computations:
a = !a | ^a
For your equation to break, you're assuming implicit conversions between some 'a' and '!a', ie, something like
(undefined :: a) :: !a (const undefined :: a -> b) :: !a -> !a
But those shouldn't typecheck! We can't decide termination, so not all objects of type 'a' can be classified into the subtypes '!a' or '^a'. Membership in '!a' can be constructive only, and map for strict lists would have a type somewhat like this
mapS :: !(!a -> !a) -> [!a] -> [!a]
Without a function, 'mapS' can't construct an element-strict result list, hence the first '!' (and 'mapS undefined' won't typecheck). Nor can it do so without a function that can construct (or pass on) strict elements, hence the '!' on the result type of the parameter function (so 'mapS (const undefined)' won't typecheck, either). Since there are no ways of producing an arbitrary '!a' out of thin air, polymorphic parameter functions will have type '!a->!a' (if 'a' gets specialised, eg to 'Int', then trivial 'Int->!Int' functions are possible).
Being conjured out of thin air, none of this might not hold up under closer scrutiny, but papers like [1] suggest that it isn't entirely out of reach - further references or counterexamples appreciated!-)
I don't think there's any realistic way to keep his equation from breaking. For instance, you might be able to prevent someone from writing: f :: !a -> !a f _ = undefined However, that still doesn't guarantee that you can't write non-terminating functions, which are semantically bottom: f :: !a -> !a f a = f a Note, that you can even write non-termination for unboxed types: n :: Int# n = n Even though such values may be supposedly unlifted. So, given that second f, clearly length (map f xs) = _|_ when length xs > 1. Total languages go through a great deal of effort to prevent this sort of thing. For instance, many use syntactic checks on the functions your write to ensure that they are structurally recursive. So, perhaps you could add these checks to a Haskell compiler, but you'd also have to ensure that you don't produce total values from non-total values, because doing structural recursion over coinductive values (like infinite lists) introduces bottom. And finally, you'd have to do similar checks on Haskell's data types, because the unrestricted recursive types that Haskell has allow you to introduce bottom without explicit recursion: newtype Wrap a = Roll { unRoll :: Wrap a -> a } -- (\x -> x x) (\x -> x x) omega :: a omega = w (Roll w) where w x = unRoll x x The solution most total languages take is to make sure that all such recursive types are strictly positive (recursive uses of the type may not be to the left of an arrow; Wrap clearly fails this check). Anyhow, it seems to me that to adequately separate all this stuff, and prevent things like 'length xs = length (mapS f xs)' from breaking, you'd end up with an entirely separate total sublanguage that doesn't interact much with the existing part (which kind of ruins the appeal of introducing such annotations in the first place). Perhaps I'm missing something, though. Cheers, Dan

claus.reinke:
I have a standard Data.Map.Map as the base structure for one of my macid data tables (jobs), but I noticed something that is probably causing problems for me. Even a simple 20 million record with int/int key values causes an out of memory error for me in ghci, Int keys, Int values eh? Does using IntMap help? Interesting. Map Int Int, IntMap Int and [(Int, Int)] use pretty much the same amount of memory, assuming that no keys or values are shared:
I haven't followed the thread, but one thing that keeps tripping me up wrt memory use is that Maps aren't strict in their values by default.
Worse, if that is not what you want for your application, working around it can be rather awkward (Data.Map at least provides insertWith', but there's no unionWith', and Data.IntMap doesn't even have insertWith').
Ideally, I'd just like to indicate the strictness in the types (hint to ghc hackers: 'Data.Map.Map Int !Int' and '[!a]' would really be useful!-), but as that isn't supported, separate operations (with seq inserted manually at suitable places) seem necessary (hint to container library maintainers: please provide full set of strict operation variants!-).
If this is (or contributes to) the problem, it should show up in heap profiles as an upward ramp and could be fixed by defining and using locally augmented versions of the Map operations.
I'd like them strict and specialised, So that: data IntMap a = Nil | Tip {-# UNPACK #-} !Key a | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(IntMap a) !(IntMap a) applied as so, type T = IntMap {-# UNPACK #-} !Int would be equivalent to data IntMapT = Nil | Tip {-# UNPACK #-} !Key {-# UNPACK #-} !Int | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(IntMap a) !(IntMap a) where we've avoided an indirection in the Tip nodes. Less space, faster access. In general, being able to specialise polymorphic structures so they look like unpacked monomorphic ones would be awesome. (!Int, !Bool) -> (,) {-# UNPACK #-}!Int {-# UNPACK #-}!Bool -- Don

On Sun, Oct 19, 2008 at 5:05 PM, Don Stewart
In general, being able to specialise polymorphic structures so they look like unpacked monomorphic ones would be awesome.
(!Int, !Bool) -> (,) {-# UNPACK #-}!Int {-# UNPACK #-}!Bool
I repeat my concern about this notation and the implications thereof. (!Int, !Bool) cannot be passed to a function accepting (a,b). However, I feel there's something very useful here that should be fleshed out rather than hacked. For example, a theory of composable strict structures in a lazy language. Luke

2008/10/20 Luke Palmer
On Sun, Oct 19, 2008 at 5:05 PM, Don Stewart
wrote: In general, being able to specialise polymorphic structures so they look like unpacked monomorphic ones would be awesome.
(!Int, !Bool) -> (,) {-# UNPACK #-}!Int {-# UNPACK #-}!Bool
I repeat my concern about this notation and the implications thereof. (!Int, !Bool) cannot be passed to a function accepting (a,b).
However, I feel there's something very useful here that should be fleshed out rather than hacked. For example, a theory of composable strict structures in a lazy language.
This paper might be of interest to you, assuming you haven't seen it: http://research.microsoft.com/~simonpj/papers/not-not-ml/index.htm. It doesn't really deal with the issue of UNPACKed data structures, but does show the first steps towards making sense of types like [!Int] vs [Int] (if we consider !Int as an "ML-style Int" (albeit a boxed one) and Int as the vanilla Haskell flavour). Cheers, Max

On Sun, 2008-10-19 at 16:05 -0700, Don Stewart wrote:
I'd like them strict and specialised,
So that:
data IntMap a = Nil | Tip {-# UNPACK #-} !Key a | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(IntMap a) !(IntMap a)
applied as so,
type T = IntMap {-# UNPACK #-} !Int
Yes, except that it should be newtype. That gives a proper boundary to apply injection / projection functions for the different internal representation. Duncan

Hello Bertram, Sunday, October 19, 2008, 6:19:31 AM, you wrote:
That's 5 words per elements
... that, like everything else, should be multiplied by 2-3 to account GC effect -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sun, 19 Oct 2008, Bulat Ziganshin wrote:
Hello Bertram,
Sunday, October 19, 2008, 6:19:31 AM, you wrote:
That's 5 words per elements
... that, like everything else, should be multiplied by 2-3 to account GC effect
Unless I'm much mistaken, that isn't the case when you're looking at the minimum heap size because the GC'll run more frequently when you hit the max heap size supported anyway, no? The 2-3 is relevant when asking if it'll go at all fast, or how much memory'll get eaten if it's there, sure. -- flippa@flippac.org Society does not owe people jobs. Society owes it to itself to find people jobs.

Hello Philippa, Sunday, October 19, 2008, 3:25:26 PM, you wrote:
... that, like everything else, should be multiplied by 2-3 to account GC effect
Unless I'm much mistaken, that isn't the case when you're looking at the minimum heap size because the GC'll run more frequently when you hit the max heap size supported anyway, no?
what you mean? max heap size is 2gb probably. it may be configured on cmdline and if you will enable say 200 mb heap and your program use only 180 mb - it will run successfully, using only 200 mb of memory. drawback, of course, is that it may become 10x slower -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sun, 19 Oct 2008, Bulat Ziganshin wrote:
Hello Philippa,
Sunday, October 19, 2008, 3:25:26 PM, you wrote:
... that, like everything else, should be multiplied by 2-3 to account GC effect
Unless I'm much mistaken, that isn't the case when you're looking at the minimum heap size because the GC'll run more frequently when you hit the max heap size supported anyway, no?
what you mean? max heap size is 2gb probably. it may be configured on cmdline and if you will enable say 200 mb heap and your program use only 180 mb - it will run successfully, using only 200 mb of memory. drawback, of course, is that it may become 10x slower
Ah, so you can't trust GHC to pick a max heap size within what the OS actually has available? That does make the RTS option rather necessary. But well worth knowing if you're trying to make something run in a known footprint. -- flippa@flippac.org The task of the academic is not to scale great intellectual mountains, but to flatten them.

Hello Philippa, Sunday, October 19, 2008, 3:58:35 PM, you wrote:
what you mean? max heap size is 2gb probably. it may be configured on
Ah, so you can't trust GHC to pick a max heap size within what the OS actually has available?
hm, this includes virtual memory too. there are code snippets that
limits heap to, say, 80% of RAM:
/* after a tip from David Roundy */
#include

Bulat Ziganshin wrote:
Hello Bertram,
Sunday, October 19, 2008, 6:19:31 AM, you wrote:
That's 5 words per elements
... that, like everything else, should be multiplied by 2-3 to account GC effect
True. You can control this factor though. Two RTS options help: -c (Enable compaction for all major collections) - mostly avoids fragmentation in the old generation. -F<factor> (Control the amount of memory reserved in terms of the size of the oldest generation. The default is 2, meaning that if the oldest generation is 200MB in size, 400 MB of heap will be used) Consider this program,
module Main (main) where
import qualified Data.IntMap as M import Data.List (foldl')
main = do loop (M.fromList [(i,0) | i <- [1..5000000]]) 1
loop dict j = do i <- readLn print $ dict M.! (i :: Int) let dict' = foldl' (\m (k, v) -> M.insert k v m) dict [(,) i $! j*i | i <- [j`mod`10 * 500000 + 1..j`mod`10 * 500000 + 500000]] loop dict' (j+1)
This program maintains an IntMap with 5 million entries, which means 200 MB of live data on a 32 bit computer. It updates the map a lot, too, so I think this is a fairly realistic example. Running it on a 51 line input with various RTS options [*], we get: Options Memory used Time used +RTS -c -F1.1 220 MB 3m22s +RTS -c -F1.2 243 MB 2m12s +RTS -c -F1.5 306 MB 1m58s +RTS -c 398 MB 1m57s +RTS -F 1.1 406 MB 1m43s +RTS -F 1.2 425 MB 1m15s +RTS -F 1.5 483 MB 1m6s none 580 MB 1m11s Heap residency was around 200.5 million bytes in all runs. As expected, saving memory this way doesn't come cheap - it can dramatically increase the program's runtime. But if a program builds and slowly updates a large dictionary, playing with these options can help a lot. Bertram [*] time (seq 50; echo 0) | ./Main +RTS -sstderr -c -F1.2
participants (11)
-
Bertram Felgenhauer
-
Bulat Ziganshin
-
Claus Reinke
-
Dan Doel
-
Derek Elkins
-
Don Stewart
-
Duncan Coutts
-
Luke Palmer
-
Max Bolingbroke
-
Philippa Cowderoy
-
Thomas Hartman