Space questions about intern and sets

Hello all, I've got two questions, both are space related so I'll put them in one e-mail. 1. I'd like to have function intern, that behaves essentially like id, but with the following constraint: if x==y then makeStableName(intern x)==makeStableName(intern y) Reasoning behind this hack: objects equal (as in Eq) should occupy the same place in memory. I've got to parse quite large file, most tokens are the same. Haskell speed is very good, but I constantly run out of memory. Here is something I wrote, but it doesn't work :( intern :: Ord a => a -> a intern x = unsafePerformIO $ internIO x iorefset :: Ord a => IORef(Map.Map a a) iorefset = unsafePerformIO $ do newIORef $ Map.empty {-# NOINLINE iorefset #-} internIO :: Ord a => a -> IO a internIO x = do myset <- readIORef iorefset case Map.lookup x myset of Just y -> do return y Nothing -> do let newset = Map.insert x x myset writeIORef iorefset newset return $ fromJust $ Map.lookup x newset --return x iorefset is re-executed, but it shouldn't. 2. Second question: imagine a Data.Set.Set and Data.Set.map with some function f, that is mostly identity, only some, rare elements are changed. As I understand it, map essentially copies whole structure of set and this leads to out-of-memory conditions for me. So, basically I have space problems. Is there any FAQ out there "how to make my program fit in 256MB of heap?" -- Gracjan

Gracjan Polak
intern :: Ord a => a -> a intern x = unsafePerformIO $ internIO x
iorefset :: Ord a => IORef(Map.Map a a) iorefset = unsafePerformIO $ do newIORef $ Map.empty
It will not work because you can't put values of different types as keys of the same dictionary, as you can't compare them. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk wrote:
Gracjan Polak
writes: intern :: Ord a => a -> a intern x = unsafePerformIO $ internIO x
iorefset :: Ord a => IORef(Map.Map a a) iorefset = unsafePerformIO $ do newIORef $ Map.empty
It will not work because you can't put values of different types as keys of the same dictionary, as you can't compare them.
I could have as many dictionaries as there are types. The problem is I get one dictionary for each object which defeats the idea. Is there any other way to safe some memory when having many same objects? -- Gracjan

On 2005 June 02 Thursday 04:38, Gracjan Polak wrote:
iorefset :: Ord a => IORef(Map.Map a a) iorefset = unsafePerformIO $ do newIORef $ Map.empty
I could have as many dictionaries as there are types. The problem is I get one dictionary for each object which defeats the idea.
To avoid unsafe operations and get control over the dictionaries that are created, I would put the desired dictionaries into a state monad. The type of 'intern' becomes Ord a => a -> DictionaryState a All the code that uses 'intern' would need some modification to deal more directly with the dictionary state. It may be more complex, but it's also more solid.

Scott Turner wrote:
On 2005 June 02 Thursday 04:38, Gracjan Polak wrote:
iorefset :: Ord a => IORef(Map.Map a a) iorefset = unsafePerformIO $ do newIORef $ Map.empty
I could have as many dictionaries as there are types. The problem is I get one dictionary for each object which defeats the idea.
To avoid unsafe operations and get control over the dictionaries that are created, I would put the desired dictionaries into a state monad. The type of 'intern' becomes Ord a => a -> DictionaryState a All the code that uses 'intern' would need some modification to deal more directly with the dictionary state. It may be more complex, but it's also more solid.
As intern behaves like id and does not have any side effects, I thought its interface should be purely functional. But I do not see any way to do it :( I'll end up with a monad, probably. In related question: does anybody here have experience/benchmarks/tests how/if is PackedString better (uses less memory) than String in parsing tasks? -- Gracjan

On Fri, 2005-06-03 at 10:53 +0200, Gracjan Polak wrote:
As intern behaves like id and does not have any side effects, I thought its interface should be purely functional. But I do not see any way to do it :( I'll end up with a monad, probably.
In related question: does anybody here have experience/benchmarks/tests how/if is PackedString better (uses less memory) than String in parsing tasks?
GHC itself uses a rather low level thing it calls FastString which is basically a pointer into a character array with a length and a unique id. The unique ids are allocated by entering each FastString into a global hash table which also provides sharing if the same string is seen more than once (like your itern feature). It is all very low level and ghc-specific however and probably only makes sence in a compiler-like application. Duncan

On Fri, Jun 03, 2005 at 04:02:09PM +0100, Duncan Coutts wrote:
On Fri, 2005-06-03 at 10:53 +0200, Gracjan Polak wrote:
As intern behaves like id and does not have any side effects, I thought its interface should be purely functional. But I do not see any way to do it :( I'll end up with a monad, probably.
In related question: does anybody here have experience/benchmarks/tests how/if is PackedString better (uses less memory) than String in parsing tasks?
GHC itself uses a rather low level thing it calls FastString which is basically a pointer into a character array with a length and a unique id. The unique ids are allocated by entering each FastString into a global hash table which also provides sharing if the same string is seen more than once (like your itern feature).
It is all very low level and ghc-specific however and probably only makes sence in a compiler-like application.
jhc has something very similar in its Atom and PackedString modules. The advantages are that it always stores strings in UTF8 so the type is a CPR type rather than a union and hence can be optimized much better. (in particular it can be {-# UNPACK #-}ed. I have not done any formal comparasons though. darcs also has its own similar thing which I believe is faster but uses FFI calls to C code rather than beping pure ghc-haskell. John -- John Meacham - ⑆repetae.net⑆john⑈

Duncan Coutts wrote:
On Fri, 2005-06-03 at 10:53 +0200, Gracjan Polak wrote:
As intern behaves like id and does not have any side effects, I thought its interface should be purely functional. But I do not see any way to do it :( I'll end up with a monad, probably.
In related question: does anybody here have experience/benchmarks/tests how/if is PackedString better (uses less memory) than String in parsing tasks?
GHC itself uses a rather low level thing it calls FastString which is basically a pointer into a character array with a length and a unique id. The unique ids are allocated by entering each FastString into a global hash table which also provides sharing if the same string is seen more than once (like your itern feature).
I thought FastString was first incarnation of PackedString, thanks for the hint it could be something more. Does HaXml use any such optimization for XML element name handling?
It is all very low level and ghc-specific however and probably only makes sence in a compiler-like application.
Exactly my setting.
Duncan
-- Gracjan

On Fri, 3 Jun 2005, Gracjan Polak wrote: ...
In related question: does anybody here have experience/benchmarks/tests how/if is PackedString better (uses less memory) than String in parsing tasks?
I don't have that information, but in case it helps, look out for splitPS - it makes the fairly common mistake of discounting a trailing separator, so both A:A: and A:A split to ["A", "A"], where the former should be ["A", "A", ""]. It does handle :A:A and A::A correctly, so you can just wrap splitPS with a function that may add a nilPS depending on the end of the input string. Donn Cave, donn@drizzle.com

Gracjan Polak wrote:
iorefset :: Ord a => IORef(Map.Map a a) iorefset = unsafePerformIO $ do newIORef $ Map.empty
I could have as many dictionaries as there are types. The problem is I get one dictionary for each object which defeats the idea.
I believe the (Ord a) constraint acts like a function argument. Therefore iorefset is no CAF, cannot be memoized itself and you get one dictionary per invocation. On the other hand, that is what is to be expected when playing games with unsafePerformIO. You might get it working by giving iorefset a monomorphic type or by specializing it for the type(s) you are using it at. Don't forget the NOINLINE pragma. I wouldn't do it this way, though. If you're parsing, chances are that your code is monadic anyway. Put a StateT over the parser monad and everything works without black magic. Even better, if you're using parsec you can just put the Map in the user state. Udo. -- "Mind if I smoke?" "I don't care if you burst into flames and die!"

Udo Stenzel wrote:
Gracjan Polak wrote:
iorefset :: Ord a => IORef(Map.Map a a) iorefset = unsafePerformIO $ do newIORef $ Map.empty
I could have as many dictionaries as there are types. The problem is I get one dictionary for each object which defeats the idea.
I believe the (Ord a) constraint acts like a function argument. Therefore iorefset is no CAF, cannot be memoized itself and you get one dictionary per invocation. On the other hand, that is what is to be expected when playing games with unsafePerformIO.
Seems you are right. Monomorphic type works, polymorphic doesn't. But it probably is not in any way guaranteed to stay like this in future.
You might get it working by giving iorefset a monomorphic type or by specializing it for the type(s) you are using it at. Don't forget the NOINLINE pragma. I wouldn't do it this way, though. If you're parsing, chances are that your code is monadic anyway. Put a StateT over the parser monad and everything works without black magic. Even better, if you're using parsec you can just put the Map in the user state.
This will create intern-per-parse, which isn't bad and has it's advantages, but I wanted to do something global. Anyway it was interesting experiment :)
Udo.
-- Gracjan

Gracjan Polak wrote:
Hello all,
I've got two questions, both are space related so I'll put them in one e-mail.
1. I'd like to have function intern, that behaves essentially like id, but with the following constraint: if x==y then makeStableName(intern x)==makeStableName(intern y) Reasoning behind this hack: objects equal (as in Eq) should occupy the same place in memory. I've got to parse quite large file, most tokens are the same. Haskell speed is very good, but I constantly run out of memory. Here is something I wrote, but it doesn't work :(
The code below seems to work for strings, and should be generalizable to any type for which you have a hash function: import Data.HashTable as H import System.IO.Unsafe (unsafePerformIO) {-# NOINLINE stringPool #-} stringPool :: HashTable String String stringPool = unsafePerformIO $ new (==) hashString {-# NOINLINE shareString #-} shareString :: String -> String shareString s = unsafePerformIO $ do mv <- H.lookup stringPool s case mv of Just s' -> return s' Nothing -> do H.insert stringPool s s return s It seems very similiar to your code, except that it uses HashTable instead of Map. /Björn

Bjorn Bringert wrote:
memory. Here is something I wrote, but it doesn't work :(
I must have been doing something really wrong that day, because today it works smoothly... :)
The code below seems to work for strings, and should be generalizable to any type for which you have a hash function:
import Data.HashTable as H import System.IO.Unsafe (unsafePerformIO)
{-# NOINLINE stringPool #-} stringPool :: HashTable String String stringPool = unsafePerformIO $ new (==) hashString
{-# NOINLINE shareString #-} shareString :: String -> String shareString s = unsafePerformIO $ do mv <- H.lookup stringPool s case mv of Just s' -> return s' Nothing -> do H.insert stringPool s s return s
Very interesting, thanks!
It seems very similiar to your code, except that it uses HashTable instead of Map.
Question is: which one is better? My tupicall file contains 160000 tokens, where 95% is taken by about 20 tokens that are used very frequently.
/Björn
-- Gracjan
participants (8)
-
Bjorn Bringert
-
Donn Cave
-
Duncan Coutts
-
Gracjan Polak
-
John Meacham
-
Marcin 'Qrczak' Kowalczyk
-
Scott Turner
-
Udo Stenzel