
On Mon, 2005-05-23 at 00:42 +0100, Duncan Coutts wrote:
On Sun, 2005-05-22 at 17:23 +1000, Manuel M T Chakravarty wrote:
This just needs a lot of space.
This is true, it does just have to keep track of a great deal of information.
Still, I wonder if there is something going on that we don't quite understand. The serialised dataset for c2hs when processing the Gtk 2.6 headers is 9.7Mb (this figure does include string sharing but this should be mostly happening when in the heap too and even if it isn't, it's only a 2x space blowup). I know that when represented in the ghc heap it will take more space than this because of all the pointers (and finite maps rather than simple lists) but that factor wouldn't account for the actual minimum heap requirements which is about 30 times bigger than the serialised format.
Actually, that could be verified experimentally by unserialising the dataset and making sure it is all in memory by using deepSeq (this would be necessary since we lazily deserialise the dataset).
From my brief experiment the 9.7 Mb file when deserialised into the heap takes just over 50Mb of heap space and top reported 47Mb RSS.
I tried another experiment and found that the parsing phase by itself required over 250Mb of heap space. By the time it got to the name analysis it requires over 350Mb. So from that it looks to me that the parser could be improved. The lexer/parser could be swapped out for another implementation without affecting any other module. Perhaps we should look at one based on Alex & Happy. Happy can do monadic parsers which would allow it to maintain the set of identifiers needed when parsing C. Alex & Happy can produce pure Haskell98 code (or ghc specific code for better performance) so the portability of c2hs would not be affected - unlike our binary serialisation patches which are use various ghc'isms. Duncan