Re: performance issues in simple arithmetic code

28 Apr 2011

      Excerpts from Denys Rtveliashvili's message of Thu Apr 28 04:41:48 -0400 2011:
...
Well.. I found some places in C-- compiler which are supposed to convert 
division and multiplication by 2^n into shifts. And I believe these work 
sometimes.
However in this case I am a bit puzzled because even if I change the 
constants in my example to 2^n like 1024 the code is not optimised.
You are referring to the mini-optimizer in cmm/CmmOpt.hs, correct?
Specifically:

    cmmMachOpFold mop args@[x, y@(CmmLit (CmmInt n _))]
      = case mop of
            MO_Mul rep
           | Just p <- exactLog2 n ->
                     cmmMachOpFold (MO_Shl rep) [x, CmmLit (CmmInt p rep)]
            MO_U_Quot rep
           | Just p <- exactLog2 n ->
                     cmmMachOpFold (MO_U_Shr rep) [x, CmmLit (CmmInt p rep)]
            MO_S_Quot rep
           | Just p <- exactLog2 n, 
             CmmReg _ <- x ->       -- We duplicate x below, hence require

See the third case.  This appears to be something of a delicate special case,
in particular, the incoming argument is required to be a register, which is not
the case in many instances:

    sef_ret()
            { [const 0;, const 34;]
            }
        ceq:
            Hp = Hp + 8;
            if (Hp > I32[BaseReg + 92]) goto ceu;
            _seg::I32 = %MO_S_Quot_W32(I32[R1 + 3], 1024); <-- oops, it's a memory load
            I32[Hp - 4] = GHC.Types.I#_con_info;
            I32[Hp] = _seg::I32;
            R1 = Hp - 3;
            Sp = Sp + 4;
            jump I32[Sp] ();
        ceu:
            I32[BaseReg + 112] = 8;
            jump (I32[BaseReg - 8]) ();
    }

(This is optimized Cmm, which you can get with -ddump-opt-cmm).

Multiplication, on the other hand, manages to pull it off more frequently:

sef_ret()
        { [const 0;, const 34;]
        }
    ceq:
        Hp = Hp + 8;
        if (Hp > I32[BaseReg + 92]) goto ceu;
        _seg::I32 = I32[R1 + 3] << 10;
        I32[Hp - 4] = GHC.Types.I#_con_info;
        I32[Hp] = _seg::I32;
        R1 = Hp - 3;
        Sp = Sp + 4;
        jump I32[Sp] ();
    ceu:
        I32[BaseReg + 112] = 8;
        jump (I32[BaseReg - 8]) ();
}

This might be a poor interaction with the inliner. I haven't investigated fully though.
...
By the way, is there any kind of documentation on how to hack C-- compiler?
In particular, I am interested in:
* how to run its optimiser against some C-- code and see what does it do
* knowing more about its internals
GHC supports compiling C-- code; just name your file with a .cmm extension
and GHC will parse it and, if it's the native backend, do some minor
optimizations and register allocation.

As usual, the GHC Trac has some useful information:

    - http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/CmmType
    - http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/Backends/NCG

I also highly recommend reading cmm/OldCmm.hs and cmm/CmmExpr.hs, which explain
the internal AST we use for Cmm, as well as cmm/OldCmmPpr.hs and cmm/CmmParse.y (and
cmm/CmmLex.x) to understand textual C--.  Note that there is also a "new" C--
representation hanging around that is not too interesting for you, since we don't
use it at all without the flag -fnew-codegen.

Edward

Re: performance issues in simple arithmetic code

Edward Z. Yang