Trying to reduce memory costs of String duplicates

Hi all, I'm reading in a data of 216k records into a map of Key, Values Pairs, the values being strings. As it happens out of 216k String values there really are only about 6.6k distinct string values, so I could save a lot of RAM if I was able to "insert" only actually *new* string values into the map and use references to (string) values that already are in memory instead. Is there a container that would, if I wanted to insert an element, return a pair of either the previously inserted, equal value and the container unchanged, or the new, previously unknown value and the new container amended by that element? Günther

Should be easy to implement one. Something like this:
class (Monad m) => MonadIntern e m | e -> m where
intern :: e -> m e
instance (Ord e) => MonadIntern e (State (M.Map e e)) where
intern = modify . insertWith (\old new -> old))
2009/9/5 Günther Schmidt
Hi all,
I'm reading in a data of 216k records into a map of Key, Values Pairs, the values being strings.
As it happens out of 216k String values there really are only about 6.6k distinct string values, so I could save a lot of RAM if I was able to "insert" only actually *new* string values into the map and use references to (string) values that already are in memory instead.
Is there a container that would, if I wanted to insert an element, return a pair of either the previously inserted, equal value and the container unchanged, or the new, previously unknown value and the new container amended by that element?
Günther
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Web IR developer, market.yandex.ru

2009/9/5 Eugene Kirpichov
Should be easy to implement one. Something like this:
class (Monad m) => MonadIntern e m | e -> m where intern :: e -> m e
instance (Ord e) => MonadIntern e (State (M.Map e e)) where intern = modify . insertWith (\old new -> old))
I mean, intern e = (modify . insertWith const $ e) >> (fromJust . (`lookup`e)) `fmap` get. However, that probably also won't compile, but I think you get the idea.
2009/9/5 Günther Schmidt
: Hi all,
I'm reading in a data of 216k records into a map of Key, Values Pairs, the values being strings.
As it happens out of 216k String values there really are only about 6.6k distinct string values, so I could save a lot of RAM if I was able to "insert" only actually *new* string values into the map and use references to (string) values that already are in memory instead.
Is there a container that would, if I wanted to insert an element, return a pair of either the previously inserted, equal value and the container unchanged, or the new, previously unknown value and the new container amended by that element?
Günther
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Web IR developer, market.yandex.ru
-- Eugene Kirpichov Web IR developer, market.yandex.ru

2009/9/5 Günther Schmidt
Hi all,
I'm reading in a data of 216k records into a map of Key, Values Pairs, the values being strings.
As it happens out of 216k String values there really are only about 6.6k distinct string values, so I could save a lot of RAM if I was able to "insert" only actually *new* string values into the map and use references to (string) values that already are in memory instead.
Is there a container that would, if I wanted to insert an element, return a pair of either the previously inserted, equal value and the container unchanged, or the new, previously unknown value and the new container amended by that element?
I believe a memoization of the identity function will do what you want: import qualified Data.MemoCombinators as Memo share = Memo.list Memo.char id Then pass any string through share to make/get a cached version. You might want to limit the scope of share -- eg. put it in a where clause for the function where you're using it -- so that it doesn't eat memory for the lifetime of your program, only for when you need it. Luke

Hi Luke,
thanks, this is some very good advice as I find many duplicates in the
data I have to iterate over.
However so far I'm unable to tell wether this actually works or not, I
tried it a couple of times under different settings but it showed to
difference in memory consumption. The same mem peeks as before.
Do you have some code that where you could see a before and after?
Günther
Am 05.09.2009, 19:38 Uhr, schrieb Luke Palmer
2009/9/5 Günther Schmidt
: Hi all,
I'm reading in a data of 216k records into a map of Key, Values Pairs, the values being strings.
As it happens out of 216k String values there really are only about 6.6k distinct string values, so I could save a lot of RAM if I was able to "insert" only actually *new* string values into the map and use references to (string) values that already are in memory instead.
Is there a container that would, if I wanted to insert an element, return a pair of either the previously inserted, equal value and the container unchanged, or the new, previously unknown value and the new container amended by that element?
I believe a memoization of the identity function will do what you want:
import qualified Data.MemoCombinators as Memo
share = Memo.list Memo.char id
Then pass any string through share to make/get a cached version.
You might want to limit the scope of share -- eg. put it in a where clause for the function where you're using it -- so that it doesn't eat memory for the lifetime of your program, only for when you need it.
Luke

On Sun, Sep 06, 2009 at 01:35:08AM +0200, Günther Schmidt wrote:
However so far I'm unable to tell wether this actually works or not, I tried it a couple of times under different settings but it showed to difference in memory consumption. The same mem peeks as before.
You may want to use 'vacuum', it shows how data is internally represented on GHC, including sharing. -- Felipe.

Günther Schmidt wrote:
Hi all,
I'm reading in a data of 216k records into a map of Key, Values Pairs, the values being strings.
As it happens out of 216k String values there really are only about 6.6k distinct string values, so I could save a lot of RAM if I was able to "insert" only actually *new* string values into the map and use references to (string) values that already are in memory instead.
Is there a container that would, if I wanted to insert an element, return a pair of either the previously inserted, equal value and the container unchanged, or the new, previously unknown value and the new container amended by that element?
If by "strings" you allow ByteStrings, then you could use the bytestring-trie package[1]. This will be more worthwhile than other approaches if your 6.6k strings have a lot of repeated prefixes, since repeated prefixes will be shared among the unique strings. Something like the following should work: intern :: ByteString -> Trie ByteString -> (ByteString, Trie ByteString) intern s t = (fromJust (lookup s t'), t') where t' = alterBy (\_ _ -> maybe (Just s) Just) s s t [1] http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-trie -- Live well, ~wren
participants (5)
-
Eugene Kirpichov
-
Felipe Lessa
-
Günther Schmidt
-
Luke Palmer
-
wren ng thornton