
Hi, I'm trying to avoid CAF's being created, and I was wondering how is the best way to do it with GHC. For example: f = .... Can I attach {-# INLINE #-} to f, and expect the CAF to be removed that way? Is there any NOCAF annotation. I can always f _, then change the callers to f () - if I'm doing that what is the most efficient fake argument, and should I annotate f in any way? If the dummy argument isn't used, is GHC going to float the let outside and re-CAF it? The next example is: foreign import ccall safe "stdio.h getchar" getchar2 :: CInt Can I stop this being CAF'd in any way? I'm working at a very low level here, so any hacky things with big disclaimers are still appreciated. Thanks Neil

| -----Original Message----- | From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users-bounces@haskell.org] On | Behalf Of Neil Mitchell | Sent: 16 May 2007 10:16 | To: glasgow-haskell-users@haskell.org | Subject: Avoiding CAF's | | Hi, | | I'm trying to avoid CAF's being created, and I was wondering how is | the best way to do it with GHC. For example: | | f = .... | | Can I attach {-# INLINE #-} to f, and expect the CAF to be removed | that way? Is there any NOCAF annotation. I can always f _, then change | the callers to f () - if I'm doing that what is the most efficient | fake argument, and should I annotate f in any way? If the dummy | argument isn't used, is GHC going to float the let outside and re-CAF | it? See this thread: http://www.nabble.com/%7B---INLINE-me_harder---%7D-tf3599366.html | | The next example is: | | foreign import ccall safe "stdio.h getchar" getchar2 :: CInt | | Can I stop this being CAF'd in any way? What is the problem you are trying to solve? S

Hi
See this thread:
http://www.nabble.com/%7B---INLINE-me_harder---%7D-tf3599366.html
I'll try and see if I can do this with RULES.
| foreign import ccall safe "stdio.h getchar" getchar2 :: CInt | | Can I stop this being CAF'd in any way?
What is the problem you are trying to solve?
I have written an optimiser for Haskell in Yhc Core, and I'm now trying to spit out optimised Haskell so GHC can create a binary to get top performance. The main loop is given in: http://hpaste.org/1862 The FFI call must create a IO monad box to put the thing in, then the main loop strips the IO box away. If I could avoid the overhead of creating the box in the meantime, that would be preferable. With that loop I'm about 10% slower than C on the "wc -c" problem using "getchar" as the character supply, using String = [Char] - and about 25% faster than GHC. Thanks Neil

Hi
See this thread:
http://www.nabble.com/%7B---INLINE-me_harder---%7D-tf3599366.html
I'll try and see if I can do this with RULES.
I remember the reason that was unsatisfactory now. The RULES only fire at high optimisation levels, whereas for this particular program the CAF/unCAF-ness of a function effects whether the program gives the correct answer. I really want to annotate in a way that changes the semantics, for which RULES should be avoided. Thanks Neil

| I remember the reason that was unsatisfactory now. The RULES only fire | at high optimisation levels, whereas for this particular program the | CAF/unCAF-ness of a function effects whether the program gives the | correct answer. That is indeed scary. Would you like to give a small example of such a program? Simon

On Wed, May 16, 2007 at 02:22:22PM +0100, Neil Mitchell wrote:
The FFI call must create a IO monad box to put the thing in, then the main loop strips the IO box away. If I could avoid the overhead of creating the box in the meantime, that would be preferable.
Does the boxing not get optimised out? Is the FFI imported function exported from the module? Thanks Ian

Hi Ian and Simon,
Ian said: Does the boxing not get optimised out? Is the FFI imported function exported from the module?
http://hpaste.org/1882 (replicated at the end of this message in case the hpaste is not around forever, but clearly layout and syntax colouring) Thats the main branch, which is the bit I want to make go faster, if at all possible. The FFI call is not exported, I have module Main(main) at the top. From what I can see, the function is being called, then: case Main.$wccall GHC.Prim.realWorld# of wild_X28 { (# ds_d2ad, ds1_d2ac #) -> i.e. it has had an artificial box put around the answer. It may be impossible to eliminate this, but if it is, I'd like to try. The motivation for all this is: http://neilmitchell.blogspot.com/2007/05/13-faster-than-ghc.html
Simon said: That is indeed scary. Would you like to give a small example of such a program?
From the above example, you can note that the first argument to Main.$sprelude_942_ll107 is an Int (v2_aVr), which is entirely ignored on the recursive branch, and then on the terminating branch is case'd in a pointless way (this case comes from a seq). If this parameter could be removed, I suspect a speedup would result.
The reason this parameter is introduced comes from the code: overlay_get_char h = inlinePerformIO (getCharIO h) foreign import ccall unsafe "stdio.h getchar" getchar :: IO CInt {-# NOINLINE getCharIO #-} getCharIO h = do c <- getchar return $ if c == (-1) then h `seq` (-1) else fromIntegral c I have artifically threaded h through getCharIO, and deliberately added a pointless seq, to ensure that the definition inside is not floated up. If I remove the h `seq` then GHC removes the argument from overlay_get_char, which turns that into a CAF, which then breaks the required semantics. I realise all of this trickery is against the spirit of a pure functional language, and is making assumptions that are not required to remain true. Right now I just want the fastest possible benchmarks though. Thanks Neil

Hi
http://hpaste.org/1882 (replicated at the end of this message in case the hpaste is not around forever, but clearly layout and syntax colouring)
For completeness, the code in question is: Main.$wccall [NEVER Nothing] :: GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Int# #) [GlobalId] [Arity 1 NoCafRefs Str: DmdType S] Main.$wccall = {__ccall getchar GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Int# #)}_d2af Rec { Main.$sprelude_942_ll107 :: GHC.Base.Int -> GHC.Prim.Int# -> GHC.Base.Int [GlobalId] [Arity 2 NoCafRefs] Main.$sprelude_942_ll107 = \ (v2_aVr :: GHC.Base.Int) (sc_s2VC :: GHC.Prim.Int#) -> case Main.$wccall GHC.Prim.realWorld# of wild_X28 { (# ds_d2ad, ds1_d2ac #) -> case ds1_d2ac of wild1_X2b { __DEFAULT -> Main.$sprelude_942_ll107 v2_aVr (GHC.Prim.+# sc_s2VC 1); (-1) -> case v2_aVr of tpl_X3g { GHC.Base.I# a_s2QS -> GHC.Base.I# sc_s2VC } } } end Rec }

Neil Mitchell wrote:
Hi Ian and Simon,
Ian said: Does the boxing not get optimised out? Is the FFI imported function exported from the module?
http://hpaste.org/1882 (replicated at the end of this message in case the hpaste is not around forever, but clearly layout and syntax colouring)
Thats the main branch, which is the bit I want to make go faster, if at all possible. The FFI call is not exported, I have module Main(main) at the top. From what I can see, the function is being called, then:
case Main.$wccall GHC.Prim.realWorld# of wild_X28 { (# ds_d2ad, ds1_d2ac #) ->
i.e. it has had an artificial box put around the answer. It may be impossible to eliminate this, but if it is, I'd like to try.
There's no actual box here. The ccall returns an unboxed tuple with two components, one of which has void type (State# RealWorld), so it has no runtime representation. The other component will be stored/returned in a register. Make sure your ccalls are annotated with unsafe, if you're not already.
I realise all of this trickery is against the spirit of a pure functional language, and is making assumptions that are not required to remain true. Right now I just want the fastest possible benchmarks though.
Ok, but suppose you get some good results - what's the point? None of this is guaranteed to work tomorrow. Why not aim for a robust translation into Core that preserves the semantics? It can't be that hard to pass around a State# RealWorld, can it? Cheers, Simon

Hi
case Main.$wccall GHC.Prim.realWorld# of wild_X28 { (# ds_d2ad, ds1_d2ac #) ->
i.e. it has had an artificial box put around the answer. It may be impossible to eliminate this, but if it is, I'd like to try.
There's no actual box here. The ccall returns an unboxed tuple with two components, one of which has void type (State# RealWorld), so it has no runtime representation. The other component will be stored/returned in a register. Make sure your ccalls are annotated with unsafe, if you're not already.
That is good news. They are already unsafe.
I realise all of this trickery is against the spirit of a pure functional language, and is making assumptions that are not required to remain true. Right now I just want the fastest possible benchmarks though.
Ok, but suppose you get some good results - what's the point? None of this is guaranteed to work tomorrow. Why not aim for a robust translation into Core that preserves the semantics? It can't be that hard to pass around a State# RealWorld, can it?
Yhc introduces all the necessary machinery to thread state through properly - so it starts off semantics correct. I then apply semantics preserving transformations, which means that the code still includes a threaded state - its just really hard to see where it went. Since I seem to have ended up passing an extra Int around as a dummy state, to stop GHC from CAF'ing stuff, it does seem to make more sense to use the GHC state from the beginning. Thanks Neil

Hi Simon,
Ok, but suppose you get some good results - what's the point? None of this is guaranteed to work tomorrow. Why not aim for a robust translation into Core that preserves the semantics? It can't be that hard to pass around a State# RealWorld, can it?
Is there a paper/document that gives the current design for the IO Monad as implemented in GHC, particularly with what things have zero runtime cost? I have a reasonable idea, but would like to make sure I know exactly whats going on before getting down to a very low level with this. Thanks Neil

Neil Mitchell wrote:
Hi Simon,
Ok, but suppose you get some good results - what's the point? None of this is guaranteed to work tomorrow. Why not aim for a robust translation into Core that preserves the semantics? It can't be that hard to pass around a State# RealWorld, can it?
Is there a paper/document that gives the current design for the IO Monad as implemented in GHC, particularly with what things have zero runtime cost? I have a reasonable idea, but would like to make sure I know exactly whats going on before getting down to a very low level with this.
Not really, this is something we should document in the commentary. The basic idea is that a value of type "State# s" for any s takes up zero registers or stack slots. However, a value of type "State# s -> a" is certainly different from a value of type "a", because for example seq can tell the difference. So an argument of type "State# s" behaves like any other argument: it can be partially applied, etc., the difference is that the actual value takes up no space. Cheers, Simon
participants (4)
-
Ian Lynagh
-
Neil Mitchell
-
Simon Marlow
-
Simon Peyton-Jones