
Hi! Do you plan to add support in c2hs for compiling with ghc-6.4. Sincerely, Gour -- Registered Linux User | #278493 GPG Public Key | 8C44EDCD

On 30/03/2005, at 12:53 AM, Gour wrote:
Hi!
Do you plan to add support in c2hs for compiling with ghc-6.4.
Hi Gour, I submitted a patch yesterday that should make C2HS build with GHC 6.4 OK now; let me know if it works for you or not. I don't think Manuel's announced it yet, but C2HS is now being stored in a Darcs repository rather than CVS. darcs get it from here: http://www.cse.unsw.edu.au/~chak/repos/c2hs/ -- % Andre Pang : trust.in.love.to.save http://www.algorithm.com.au/

Andre Pang (ozone@algorithm.com.au) wrote: Hi Andre!
Hi Gour, I submitted a patch yesterday that should make C2HS build with GHC 6.4 OK now; let me know if it works for you or not.
Yes, it compiles nicely!
I don't think Manuel's announced it yet, but C2HS is now being stored in a Darcs repository rather than CVS. darcs get it from here:
Great news to see c2hs in Darcs repo :-)) btw, gtk2hs devs have problem with space leaks in c2hs, ie. one requires over 1GB of RAM to process gtk2 headers. Do you have some idea how to make c2hs definite Haskell ffi tool? Sincerely, Gour -- Registered Linux User | #278493 GPG Public Key | 8C44EDCD

On 19/05/2005, at 12:15 AM, Gour wrote:
btw, gtk2hs devs have problem with space leaks in c2hs, ie. one requires over 1GB of RAM to process gtk2 headers.
I think Duncan Coutts was working on this; I don't know any of the details, so we'll have to wait for his reply ... -- % Andre Pang : trust.in.love.to.save http://www.algorithm.com.au/

André Pang (ozone@algorithm.com.au) wrote:
I think Duncan Coutts was working on this; I don't know any of the details, so we'll have to wait for his reply ...
I know he was working on it, but there are still problems in finding appropriate workaround in building and ditributing gtk2hs bindings, so I though that more eyes can see better :-) Sincerely, Gour -- Registered Linux User | #278493 GPG Public Key | 8C44EDCD

On Thu, 2005-05-19 at 16:26 +1000, André Pang wrote:
On 19/05/2005, at 12:15 AM, Gour wrote:
btw, gtk2hs devs have problem with space leaks in c2hs, ie. one requires over 1GB of RAM to process gtk2 headers.
I think Duncan Coutts was working on this; I don't know any of the details, so we'll have to wait for his reply ...
Well... Our considered opinion (Axel and myself) is that c2hs's memory consumption is a very difficult thing to fix and any fix we might be able to come up with would likely be very invasive and so Manuel would not be very keen on the idea. One approach I havn't tried but might bear some fruit is to check that c2hs is actually using all the data it collects, or if in fact much of the AST goes unused in which case it could be eliminated. However I don't imagine that this would give any enourmous savings (ie enough to process the Gtk 2.x headers on a machine with 256Mb or RAM). I do have another idea however which I would like to get some feedback upon... Basically the idea is that we want to to only run c2hs on the developers machine and distribute the resulting .hs files. That way only the developers machines need 1Gb of RAM. But we also want the resulting .hs files to be portable. Portable both between architectures and between different versions of Gtk+. For the architecture independence all we have to do is make sure we are not using the c2hs {# get #} {# set #} features since they embed field offsets into the .hs file which is not portable. This is not a great hardship for us since we mostly use hsc2hs for doing structure access and c2hs for calling functions. Having the .hs files work with different Gtk+ versions is more tricky. The idea here is to run cpp on the .hs files after running c2hs, so we distribute the .hs files output from c2hs and run cpp over them on the target machine when we know what version of Gtk+ we are targeting. For this to work we need to run c2hs with the latest version of Gtk+ (since it has to be a superset of all versions we intend the .hs files to support) and we need to have c2hs pass the preprocessor directives through to the .hs files. In fact it is slightly more complicated that this. We can't just have c2hs ignore the cpp directives since then the problem would be that the foriegn import declarations that c2hs adds at the end of the .hs file would not be in the context of the cpp directives where the {# call #} was used. So to fix this problem I have hacked up a patch such that the cpp context is output along with the foreign import declarations. This takes advantage of the existing feature in c2hs where it interprets the cpp directives, but this uses it for a different purpose. I don't think this approach is too invasive, it is cartainly much less so than our existing precompiled headers patch or any proposed heap reduction strategy. I do not yet know if this approach will work fully, I'm still asessing its feasability. If it does turn out to be a workable approach, I'd be keen to discuss with Manuel wether he might accept such a feature (controlled by some command line flag) into the main c2hs. Duncan

Duncan Coutts (duncan.coutts@worc.ox.ac.uk) wrote:
Our considered opinion (Axel and myself) is that c2hs's memory consumption is a very difficult thing to fix and any fix we might be able to come up with would likely be very invasive and so Manuel would not be very keen on the idea.
Pls. excuse me for dumb question (I'm not familiar with c2hs' internals): is c2hs' memory consumption result of its design, or consequence of unfixed space-leaks, ie. is it fixable without re-writing?
One approach I havn't tried but might bear some fruit is to check that c2hs is actually using all the data it collects, or if in fact much of the AST goes unused in which case it could be eliminated. However I don't imagine that this would give any enourmous savings (ie enough to process the Gtk 2.x headers on a machine with 256Mb or RAM).
So, let's not dwell on it.
Basically the idea is that we want to to only run c2hs on the developers machine and distribute the resulting .hs files. That way only the developers machines need 1Gb of RAM.
But, don't we cut with this the (potential) number of devs who can hack on repository code (maybe not relevant now, but hopefully it will become) ? And what is the real amount of RAM needed? I could not compile gtk2hs with 1GB of RAM, but I'm not sure whether it is because of problems with ghc on amd64 or due to c2hs memory consumption?
But we also want the resulting .hs files to be portable. Portable both between architectures and between different versions of Gtk+. For the architecture independence all we have to do is make sure we are not using the c2hs {# get #} {# set #} features since they embed field offsets into the .hs file which is not portable. This is not a great hardship for us since we mostly use hsc2hs for doing structure access and c2hs for calling functions.
Are there some other spots where using hsc2hs (instead of depending on c2hs) can become handy?
I do not yet know if this approach will work fully, I'm still asessing its feasability. If it does turn out to be a workable approach, I'd be keen to discuss with Manuel wether he might accept such a feature (controlled by some command line flag) into the main c2hs.
Although I like concept of c2hs very much, still I consider that in case we cannot have gtk2hs build on 'normal' machines with the 'normal' invoking of c2hs - fixing the present c2hs memory consumption - then we have to be pragmatic and use and/or tailor available tools so that the job can be done, not thinking about compatibility with the 'official' c2hs development. At the end, c2hs should be a tool helping us developing gtk2hs bindings, and not vice versa. However, I still hope that (somehow) c2hs can be brought down to the 'normal' mem reqs...and (hopefully) this discussion can shed some more light. Manuel, as the author of c2hs, what is your piece of advice? Sincerely, Gour -- Registered Linux User | #278493 GPG Public Key | 8C44EDCD

Am Donnerstag, den 19.05.2005, 14:32 +0100 schrieb Duncan Coutts:
On Thu, 2005-05-19 at 16:26 +1000, André Pang wrote:
On 19/05/2005, at 12:15 AM, Gour wrote:
btw, gtk2hs devs have problem with space leaks in c2hs, ie. one requires over 1GB of RAM to process gtk2 headers.
First of all, I am not convinced that we are having a space leak in c2hs. Let's look at what c2hs does. It runs cpp over a header, which for GTK+ gives one enormous file with C declarations. c2hs needs to read the whole thing, as due to the nature of C, it is impossible to judge a priori which declarations are relevant for the binding at hand. This just needs a lot of space. It is probably possible to come up with a more efficient representation of the AST, but that would probably be quite some work to implement.
Our considered opinion (Axel and myself) is that c2hs's memory consumption is a very difficult thing to fix and any fix we might be able to come up with would likely be very invasive and so Manuel would not be very keen on the idea.
I don't mind it being invasive if (1) it is not gtk2hs specific (ie, it must be generally useful) and (2) and doesn't conflict with other features and/or the basic structure.
One approach I havn't tried but might bear some fruit is to check that c2hs is actually using all the data it collects, or if in fact much of the AST goes unused in which case it could be eliminated. However I don't imagine that this would give any enourmous savings (ie enough to process the Gtk 2.x headers on a machine with 256Mb or RAM).
I am sure lots of the AST isn't used, but we won't know until after most work is done. To do it's work, c2hs needs all declarations on which any symbols bound from Haskell directly or indirectly depend.
I do have another idea however which I would like to get some feedback upon...
Basically the idea is that we want to to only run c2hs on the developers machine and distribute the resulting .hs files. That way only the developers machines need 1Gb of RAM. But we also want the resulting .hs files to be portable. Portable both between architectures and between different versions of Gtk+. For the architecture independence all we have to do is make sure we are not using the c2hs {# get #} {# set #} features since they embed field offsets into the .hs file which is not portable. This is not a great hardship for us since we mostly use hsc2hs for doing structure access and c2hs for calling functions.
Having the .hs files work with different Gtk+ versions is more tricky. The idea here is to run cpp on the .hs files after running c2hs, so we distribute the .hs files output from c2hs and run cpp over them on the target machine when we know what version of Gtk+ we are targeting. For this to work we need to run c2hs with the latest version of Gtk+ (since it has to be a superset of all versions we intend the .hs files to support) and we need to have c2hs pass the preprocessor directives through to the .hs files.
In fact it is slightly more complicated that this. We can't just have c2hs ignore the cpp directives since then the problem would be that the foriegn import declarations that c2hs adds at the end of the .hs file would not be in the context of the cpp directives where the {# call #} was used. So to fix this problem I have hacked up a patch such that the cpp context is output along with the foreign import declarations. This takes advantage of the existing feature in c2hs where it interprets the cpp directives, but this uses it for a different purpose. I don't think this approach is too invasive, it is cartainly much less so than our existing precompiled headers patch or any proposed heap reduction strategy.
I do not yet know if this approach will work fully, I'm still asessing its feasability. If it does turn out to be a workable approach, I'd be keen to discuss with Manuel wether he might accept such a feature (controlled by some command line flag) into the main c2hs.
I don't know what you mean by the cpp context. Moreover, I would like a clear story on what cpp directives are passed through and what are interpreted. What I don't like about this approach is that it is to an extent gtk2hs specific. Let me explain. Not needing c2hs on user machines would be a Good Thing. Supporting this only for the subset of features used by gtk2hs (ie, no set and get hooks) is bad. Manuel

On Sun, 2005-05-22 at 17:23 +1000, Manuel M T Chakravarty wrote:
Am Donnerstag, den 19.05.2005, 14:32 +0100 schrieb Duncan Coutts:
On Thu, 2005-05-19 at 16:26 +1000, André Pang wrote:
On 19/05/2005, at 12:15 AM, Gour wrote:
btw, gtk2hs devs have problem with space leaks in c2hs, ie. one requires over 1GB of RAM to process gtk2 headers.
First of all, I am not convinced that we are having a space leak in c2hs.
Yes I think that's right. I did quite a bit of profiling work last year and I didn't notice anything that looked to me like a space leak.
Let's look at what c2hs does. It runs cpp over a header, which for GTK+ gives one enormous file with C declarations. c2hs needs to read the whole thing, as due to the nature of C, it is impossible to judge a priori which declarations are relevant for the binding at hand.
This just needs a lot of space.
This is true, it does just have to keep track of a great deal of information. Still, I wonder if there is something going on that we don't quite understand. The serialised dataset for c2hs when processing the Gtk 2.6 headers is 9.7Mb (this figure does include string sharing but this should be mostly happening when in the heap too and even if it isn't, it's only a 2x space blowup). I know that when represented in the ghc heap it will take more space than this because of all the pointers (and finite maps rather than simple lists) but that factor wouldn't account for the actual minimum heap requirements which is about 30 times bigger than the serialised format. Actually, that could be verified experimentally by unserialising the dataset and making sure it is all in memory by using deepSeq (this would be necessary since we lazily deserialise the dataset).
It is probably possible to come up with a more efficient representation of the AST, but that would probably be quite some work to implement.
Our considered opinion (Axel and myself) is that c2hs's memory consumption is a very difficult thing to fix and any fix we might be able to come up with would likely be very invasive and so Manuel would not be very keen on the idea.
I don't mind it being invasive if
(1) it is not gtk2hs specific (ie, it must be generally useful) and (2) and doesn't conflict with other features and/or the basic structure.
Both reasonable. We'll keep that in mind if we try for a heap reduction patch. I think if I were to try this again, I'd try and make the name analysis phase into an external algorithm, keeping as much of the various finite maps in external files for most of the time.
One approach I havn't tried but might bear some fruit is to check that c2hs is actually using all the data it collects, or if in fact much of the AST goes unused in which case it could be eliminated. However I don't imagine that this would give any enourmous savings (ie enough to process the Gtk 2.x headers on a machine with 256Mb or RAM).
I am sure lots of the AST isn't used, but we won't know until after most work is done. To do it's work, c2hs needs all declarations on which any symbols bound from Haskell directly or indirectly depend.
I know there will be lots of symbols that each particular .chs file will not use. I meant bits that wouldn't possibly ever be used for any possible .chs file. But that's also why I said it probably wouldn't be much of a saving.
I do have another idea however which I would like to get some feedback upon...
Basically the idea is that we want to to only run c2hs on the developers machine and distribute the resulting .hs files.
(snip)
I do not yet know if this approach will work fully, I'm still asessing its feasability. If it does turn out to be a workable approach, I'd be keen to discuss with Manuel wether he might accept such a feature (controlled by some command line flag) into the main c2hs.
I don't know what you mean by the cpp context. Moreover, I would like a clear story on what cpp directives are passed through and what are interpreted.
It's easiest to explain with an example: We have some bit of code in a .chs file that is compiled conditionaly based on some cpp test: #ifdef USE_GCLOSUE_SIGNALS_IMPL connectGeneric :: GObjectClass obj => ... snip ... {# call g_signal_connect_closure #} #else ... etc ... and with my patched c2hs it ouputs this FFI imports to the end of the .hs file: #ifdef USE_GCLOSUE_SIGNALS_IMPL foreign import ccall safe " g_signal_connect_closure" g_signal_connect_closure :: ((Ptr ()) -> ((Ptr CChar) -> ((Ptr GClosure) -> (CInt -> (IO CULong))))) #endif So what it does is use the existing code that collects the cpp directives but then instead of doing the buisness of building a .h file from them and discarding the cpp directives from the list of fragments to be output, it keeps them in the output .hs file. Then when going over the .chs fragments expanding all the hooks, it maintains a stack of the cpp directives (ie push when we encounter an #if and pop when we see #endif) so when we get to a call hook we need to expand we know the "cpp context". So we pass this "cpp context" down to the code that generates the deferred code for the foreign import ccall declearations and use that to reconstruct the cpp conditional directives and surround the ffi import declaration.
What I don't like about this approach is that it is to an extent gtk2hs specific. Let me explain. Not needing c2hs on user machines would be a Good Thing. Supporting this only for the subset of features used by gtk2hs (ie, no set and get hooks) is bad.
Yes that is not great for a feature to go into mainline c2hs. Perhaps this 'do cpp after chs' mode should output .hsc files to be further processed by hsc2hs. That way it could output #offset & #size macros where c2hs would normally output numeric constants. Duncan

Duncan Coutts (duncan.coutts@worc.ox.ac.uk) wrote:
Yes I think that's right. I did quite a bit of profiling work last year and I didn't notice anything that looked to me like a space leak.
Huh...I'm a little bit puzzled now 'cause on dev list the discussion was going on about 'space leaks'. Anyway, the facts remains the same for the end user: it's not possible to build gtk2hs without having enormous amount of RAM.
Both reasonable. We'll keep that in mind if we try for a heap reduction patch.
Am I right assuming that this could be even more difficult than fixing (non-existant) space leak?
I think if I were to try this again, I'd try and make the name analysis phase into an external algorithm, keeping as much of the various finite maps in external files for most of the time.
Let me know if I can help with some sort of testing. Sincerely, Gour -- Registered Linux User | #278493 GPG Public Key | 8C44EDCD

On Mon, 2005-05-23 at 09:56 +0200, Gour wrote:
Duncan Coutts (duncan.coutts@worc.ox.ac.uk) wrote:
Yes I think that's right. I did quite a bit of profiling work last year and I didn't notice anything that looked to me like a space leak.
Huh...I'm a little bit puzzled now 'cause on dev list the discussion was going on about 'space leaks'.
Well it wasn't me :-). The typical cause of vast memory consumption is a space leak, so it's often something people mention.
Anyway, the facts remains the same for the end user: it's not possible to build gtk2hs without having enormous amount of RAM.
Both reasonable. We'll keep that in mind if we try for a heap reduction patch.
Am I right assuming that this could be even more difficult than fixing (non-existant) space leak?
Hard to say. :-) It means changing algorithms and/or data structures rather than trying to indentify and understand space leaks.
I think if I were to try this again, I'd try and make the name analysis phase into an external algorithm, keeping as much of the various finite maps in external files for most of the time.
Let me know if I can help with some sort of testing.
Ok. Duncan

Duncan Coutts (duncan.coutts@worc.ox.ac.uk) wrote:
Well it wasn't me :-).
:-)
It means changing algorithms and/or data structures rather than trying to indentify and understand space leaks.
True. That's why I think like that, ie. it's the question of design. Sincerely, Gour -- Registered Linux User | #278493 GPG Public Key | 8C44EDCD

On Mon, 2005-05-23 at 00:42 +0100, Duncan Coutts wrote:
On Sun, 2005-05-22 at 17:23 +1000, Manuel M T Chakravarty wrote:
This just needs a lot of space.
This is true, it does just have to keep track of a great deal of information.
Still, I wonder if there is something going on that we don't quite understand. The serialised dataset for c2hs when processing the Gtk 2.6 headers is 9.7Mb (this figure does include string sharing but this should be mostly happening when in the heap too and even if it isn't, it's only a 2x space blowup). I know that when represented in the ghc heap it will take more space than this because of all the pointers (and finite maps rather than simple lists) but that factor wouldn't account for the actual minimum heap requirements which is about 30 times bigger than the serialised format.
Actually, that could be verified experimentally by unserialising the dataset and making sure it is all in memory by using deepSeq (this would be necessary since we lazily deserialise the dataset).
From my brief experiment the 9.7 Mb file when deserialised into the heap takes just over 50Mb of heap space and top reported 47Mb RSS.
I tried another experiment and found that the parsing phase by itself required over 250Mb of heap space. By the time it got to the name analysis it requires over 350Mb. So from that it looks to me that the parser could be improved. The lexer/parser could be swapped out for another implementation without affecting any other module. Perhaps we should look at one based on Alex & Happy. Happy can do monadic parsers which would allow it to maintain the set of identifiers needed when parsing C. Alex & Happy can produce pure Haskell98 code (or ghc specific code for better performance) so the portability of c2hs would not be affected - unlike our binary serialisation patches which are use various ghc'isms. Duncan

Duncan Coutts (duncan.coutts@worc.ox.ac.uk) wrote:
Perhaps we should look at one based on Alex & Happy. Happy can do monadic parsers which would allow it to maintain the set of identifiers needed when parsing C. Alex & Happy can produce pure Haskell98 code (or ghc specific code for better performance) so the portability of c2hs would not be affected - unlike our binary serialisation patches which are use various ghc'isms.
Go Duncan, go ;) Sincerely, Gour -- Registered Linux User | #278493 GPG Public Key | 8C44EDCD

Duncan Coutts:
On Mon, 2005-05-23 at 00:42 +0100, Duncan Coutts wrote: From my brief experiment the 9.7 Mb file when deserialised into the heap takes just over 50Mb of heap space and top reported 47Mb RSS.
I tried another experiment and found that the parsing phase by itself required over 250Mb of heap space. By the time it got to the name analysis it requires over 350Mb.
So from that it looks to me that the parser could be improved. The lexer/parser could be swapped out for another implementation without affecting any other module.
Perhaps we should look at one based on Alex & Happy. Happy can do monadic parsers which would allow it to maintain the set of identifiers needed when parsing C. Alex & Happy can produce pure Haskell98 code (or ghc specific code for better performance) so the portability of c2hs would not be affected - unlike our binary serialisation patches which are use various ghc'isms.
An Alex/Happy parser would be an option if it improves matters significantly. If you or anybody else has a go at it, please follow the C language definition closely, as does the existing lexer and parser (with comments stating to which sections in K&R the individual productions relate). Moreover, the module c2hs/base/syntax/ParserMonad.hs already provides a monad suitable for Happy. Manuel

On Wed, 2005-05-25 at 11:44 +1000, Manuel M T Chakravarty wrote:
An Alex/Happy parser would be an option if it improves matters significantly. If you or anybody else has a go at it, please follow the C language definition closely, as does the existing lexer and parser (with comments stating to which sections in K&R the individual productions relate).
Intrim status update: I've started with the lexer, using alex. I've converted clause for clause the existing lexer, keeping the K&R language definition comments. The output is exactly the same as for the existing lexer on my test file of gtk.i (cpp output of gtk/gtk.h) which is 1014K. It's a tad faster and runs in minimal space. I timed how long it takes to find the length of the list of tokens (so there's no IO overhead). I used ghc -O for the lexer modules and all relevant dependent modules. On my old slow 500Mhz sparc, the existing lexer takes 7.8 seconds and needs 14Mb heap minimum while the alex lexer takes 1.8 seconds and runs in less than 1Mb heap. One issue I noticed with the existing lexer is that it seems to be strict, ie the whole token stream is built before it is returned. The alex lexer is lazy which would explain the difference in heap usage. This suggests the existing lexer could be made to perform better if it were made lazy.
Moreover, the module c2hs/base/syntax/ParserMonad.hs already provides a monad suitable for Happy.
I'll take a look. As for the lexer/parser interaction required for C, I guess the way to do this is to make the lexer monad keep the set of 'typedefed' identifiers and return the appropriate token type to the parser depending on membership of that set. The lexer monad should also expose an operation to the parser to add identifiers to the 'typedefed' set which the parser would use after parseing the appropriate kind of declaration. Duncan

Duncan Coutts:
On Wed, 2005-05-25 at 11:44 +1000, Manuel M T Chakravarty wrote:
An Alex/Happy parser would be an option if it improves matters significantly. If you or anybody else has a go at it, please follow the C language definition closely, as does the existing lexer and parser (with comments stating to which sections in K&R the individual productions relate).
Intrim status update:
I've started with the lexer, using alex. I've converted clause for clause the existing lexer, keeping the K&R language definition comments.
The output is exactly the same as for the existing lexer on my test file of gtk.i (cpp output of gtk/gtk.h) which is 1014K.
It's a tad faster and runs in minimal space. I timed how long it takes to find the length of the list of tokens (so there's no IO overhead). I used ghc -O for the lexer modules and all relevant dependent modules. On my old slow 500Mhz sparc, the existing lexer takes 7.8 seconds and needs 14Mb heap minimum while the alex lexer takes 1.8 seconds and runs in less than 1Mb heap.
Cool.
One issue I noticed with the existing lexer is that it seems to be strict, ie the whole token stream is built before it is returned. The alex lexer is lazy which would explain the difference in heap usage. This suggests the existing lexer could be made to perform better if it were made lazy.
When I originally wrote the lexer, I benchmarked a number of variants. The result was that the strict one was more efficient, but that was long ago with an ancient version of ghc.
Moreover, the module c2hs/base/syntax/ParserMonad.hs already provides a monad suitable for Happy.
I'll take a look.
As for the lexer/parser interaction required for C, I guess the way to do this is to make the lexer monad keep the set of 'typedefed' identifiers and return the appropriate token type to the parser depending on membership of that set. The lexer monad should also expose an operation to the parser to add identifiers to the 'typedefed' set which the parser would use after parseing the appropriate kind of declaration.
Sounds right to me. Cheers, Manuel

On Thu, 2005-05-26 at 16:46 +1000, Manuel M T Chakravarty wrote:
Duncan Coutts:
On Wed, 2005-05-25 at 11:44 +1000, Manuel M T Chakravarty wrote:
An Alex/Happy parser would be an option if it improves matters significantly. If you or anybody else has a go at it, please follow the C language definition closely, as does the existing lexer and parser (with comments stating to which sections in K&R the individual productions relate).
Intrim status update:
Another status update: I started with a yacc C grammar and added the semantic actions by translating from the existing parser. I reordered the clauses to match the order in the original parser and kept the comments (but removed the bits that are no longer true). I think I've added back in all the syntactic extensions supported by c2hs. There is one shift/reduce conflict to do with the "if then else" syntax which I believe is benign. It's not quite finished yet (in particular I've not done allocation of attributes), but it does already parse the Gtk+ headers. It is quite quick and importantly uses very little heap space. On my old 500MHz sparc, parsing the Gtk+ headers takes 8 seconds and requires 28Mb of heap. There appears to be essentially no allocation apart from that used by the AST, since the parser does not get slower when the heap size is very slightly larger than the minimum required. Depending on the heap limit it either runs at full speed or runs out of heap. Going back to the lexer, it now produces exactly the same output as the original lexer (including positions and unique names). Sadly it seems to have got quite a bit slower for reasons I don't quite understand. In particular making it monadic (which we need to do because of) seems to make it rather slower. It is now taking 6 seconds rather than 2 and so is now only a little faster that the original lexer. Though on the positive side it means that if the lexer is taking 6 out of the 8 second total then the parser is only taking 2 seconds which is quite good. One important speedup I got was to change the identifier vs. reserved word lookup so that it does not use a linear search over the list of reserved word strings. That cut overall parsing time down from 10 seconds to 8. I've slightly generalised the grammar for GNU C __attribute__s after looking at the GCC manual. So the remaining things to be done before posting the stuff for review is to get the attribute allocation going and to check that the AST produced by this parser is exactly the same as for the existing parser. Then integrate this parser into c2hs and see what the overall memory requirements turn out to be taking into account the name analysis phase.
Moreover, the module c2hs/base/syntax/ParserMonad.hs already provides a monad suitable for Happy.
I'll take a look.
Unfortunately, since the monad used by the parser has to be the same as the one used by the lexer, the existing parser monad is not suitable. I've based another one on the happy monad example and the ghc lexer/parser monad.
As for the lexer/parser interaction required for C, I guess the way to do this is to make the lexer monad keep the set of 'typedefed' identifiers and return the appropriate token type to the parser depending on membership of that set. The lexer monad should also expose an operation to the parser to add identifiers to the 'typedefed' set which the parser would use after parseing the appropriate kind of declaration.
Sounds right to me.
I got this scheme working after encountering one interesting problem along the way... I had to refactor the C grammar slightly to move the reduction rule for typedefs (which adds the typedef'ed names to the lexer/parser monad state) below the ';' terminal in declarations so that constructs like the following work: typedef int foo; foo bar (int baz); Otherwise the 'foo' token on the second line is not recognised as a type name since the reduction rule for the previous declaration is not done until the 'foo' token has already been seen. With the slightly refactored grammar/semantic actions the reduction rule is invoked as soon as the ';' token is encountered and so it all works. (With thanks to Heffalump and skew on #haskell for help on this issue) Duncan

On Mon, 2005-05-30 at 19:18 +0100, Duncan Coutts wrote:
Sadly it seems to have got quite a bit slower for reasons I don't quite understand. In particular making it monadic (which we need to do because of) seems to make it rather slower. It is now taking 6 seconds rather than 2 and so is now only a little faster that the original lexer. Though on the positive side it means that if the lexer is taking 6 out of the 8 second total then the parser is only taking 2 seconds which is quite good.
Oops, slightly wrong about that. I was timing the unoptimised (ie no -O) version. The lexer actually takes 3 seconds on my old sparc (and 0.3 on my nice new athlon). So that's not so much of an increase from the 2 seconds I had initially. So that means the parser is taking 5 out of the 8 seconds total which makes more sense. On my fast athlon the parsing time is 1.7 seconds which is really pretty quick! It might even be quick enough that we don't need the precomp feature in future, but we'll see how long it takes when we add in the name analysis too. Duncan

Duncan Coutts (duncan.coutts@worc.ox.ac.uk) wrote:
On my fast athlon the parsing time is 1.7 seconds which is really pretty quick! It might even be quick enough that we don't need the precomp feature in future, but we'll see how long it takes when we add in the name analysis too.
Congratulations, Duncan! I always believed you can do it and improve c2hs so one can use it with decent memory reqs. Sincerely, Gour -- Registered Linux User | #278493 GPG Public Key | 8C44EDCD

Am Montag, den 30.05.2005, 19:18 +0100 schrieb Duncan Coutts:
On Thu, 2005-05-26 at 16:46 +1000, Manuel M T Chakravarty wrote:
Duncan Coutts:
On Wed, 2005-05-25 at 11:44 +1000, Manuel M T Chakravarty wrote:
An Alex/Happy parser would be an option if it improves matters significantly. If you or anybody else has a go at it, please follow the C language definition closely, as does the existing lexer and parser (with comments stating to which sections in K&R the individual productions relate).
Intrim status update:
Another status update:
I started with a yacc C grammar and added the semantic actions by translating from the existing parser. I reordered the clauses to match the order in the original parser and kept the comments (but removed the bits that are no longer true). I think I've added back in all the syntactic extensions supported by c2hs. There is one shift/reduce conflict to do with the "if then else" syntax which I believe is benign.
It's not quite finished yet (in particular I've not done allocation of attributes), but it does already parse the Gtk+ headers. It is quite quick and importantly uses very little heap space. [..]
Sounds all very promising. Thanks for the effort! Manuel

On Mon, 2005-05-30 at 19:18 +0100, Duncan Coutts wrote: [..]
Going back to the lexer, it now produces exactly the same output as the original lexer (including positions and unique names). Sadly it seems to have got quite a bit slower for reasons I don't quite understand. In particular making it monadic (which we need to do because of) seems to make it rather slower. It is now taking 6 seconds rather than 2 and so is now only a little faster that the original lexer. Though on the positive side it means that if the lexer is taking 6 out of the 8 second total then the parser is only taking 2 seconds which is quite good.
Ok, I'm impressed, too. But was the parser the culprit? It did use a lot of space, but then most of the time in our current setup is spent in serialisation. So if I understand your intention you mainly try to improve the memory footprint, not the compilation time? Nice effort, Axel.

On Tue, 2005-05-31 at 08:15 +0100, Axel Simon wrote:
On Mon, 2005-05-30 at 19:18 +0100, Duncan Coutts wrote:
[..]
Going back to the lexer, it now produces exactly the same output as the original lexer (including positions and unique names). Sadly it seems to have got quite a bit slower for reasons I don't quite understand. In particular making it monadic (which we need to do because of) seems to make it rather slower. It is now taking 6 seconds rather than 2 and so is now only a little faster that the original lexer. Though on the positive side it means that if the lexer is taking 6 out of the 8 second total then the parser is only taking 2 seconds which is quite good.
Ok, I'm impressed, too. But was the parser the culprit? It did use a lot of space, but then most of the time in our current setup is spent in serialisation. So if I understand your intention you mainly try to improve the memory footprint, not the compilation time?
Basically yes. The real problem was the memory use. The existing parser was taking 270Mb for the Gtk+ headers while this new one now takes 29Mb. I've tried integrating this parser into c2hs and overall, producing the precomp file now runs in 80Mb of heap space. In fact a significant minority of that space is only required during the serialisation, the name analysis phase only pushes the memory requirements up to 50Mb or so. (I may be wrong about that, it may be that the serialisation is simply forcing the result of the name analysis which thereby increases the heap use.) The slowness of the serialisation is a seperate problem. But reducing the memory requirements of the other phases makes even that part faster. On my fast athlon it used to take about a minute to generate the Gtk+ precomp file (and 380Mb). It now takes 13 seconds (and 80Mb). I guess the improvement to the time taken to do the serialisation is mostly from having to do less GC. There's still some small difference in the precomp file which I have not yet tracked down (but in my earlier parser tests, the AST seems to be exactly the same, right down to the source locations and unique names). So I think it's worth trying to get this done for the 0.9.8 gtk2hs release. That should provide reasonable testing and then we can create patches for the mainline c2hs. Duncan

thanks for your effort Duncan. I'll able to compile gtk2hs on my platform in the future. :-) for the far futur we should think to gobject introspection. johan Dahlin wrote a python biding of libpopper (pdf library) using gobject introspection. Alain.
participants (8)
-
Andre Pang
-
André Pang
-
Axel Simon
-
Duncan Coutts
-
Gour
-
Gour
-
Manuel M T Chakravarty
-
ROUGE Alain