
Hi, While trying to figure out why some of my code is very slow I have found that it is something related to division. Digging a bit deeper I found an example which shows some unexpected magic and a lack of the expected one. Before raising any tickets in trac I would like to consult with you regarding what I see. Maybe I am misunderstanding the way GHC is supposed to work. ------------------- module Test where import Data.Int import GHC.Exts import GHC.Prim foo :: Int -> Int foo a = let b = a `quot` 1111 c = b `quot` 1113 d = c `quot` 1117 in d bar :: Int -> Int bar !a' = let !(I# a) = a' !(b) = quotInt# a 1111# !(c) = quotInt# b 1113# !(d) = quotInt# c 1117# in I# d ------------------- Here 'foo' is a function written in a common way and 'bar' is essentially identical one, written in a low-level style. * My understanding is that these functions are equivalent in terms of what they do. The only difference is in the code being generated. Unexpected magic is in the Core dump: ------------------- Test.$wfoo = \ (ww_sxw :: GHC.Prim.Int#) -> case ww_sxw of wild1_ax0 { __DEFAULT -> case GHC.Prim.quotInt# wild1_ax0 1111 of wild2_Xxc { __DEFAULT -> case GHC.Prim.quotInt# wild2_Xxc 1113 of wild3_Xxt { __DEFAULT -> GHC.Prim.quotInt# wild3_Xxt 1117; (-9223372036854775808) -> (-8257271295304186) }; (-9223372036854775808) -> (-7418931981405) }; (-9223372036854775808) -> (-6677706553) } Test.bar = \ (a'_ah5 :: GHC.Types.Int) -> case a'_ah5 of _ { GHC.Types.I# ipv_ste -> GHC.Types.I# (GHC.Prim.quotInt# (GHC.Prim.quotInt# (GHC.Prim.quotInt# ipv_ste 1111) 1113) 1117) } ------------------- Question 1: what is the meaning of those magic numbers -9223372036854775808, -6677706553, -7418931981405, -8257271295304186? Question 2: under which circumstances those strange branches of execution will be used and what those results would mean? Question 3: why is the Core for 'foo' so different to 'bar'? The lack of expected magic is in the assembler code: ------------------- addq $16,%r12 cmpq 144(%r13),%r12 ja .Lcz1 movl $1117,%ecx movl $1113,%r10d movl $1111,%r11d movq 7(%rbx),%rax cqto idivq %r11 cqto idivq %r10 cqto idivq %rcx movq $ghczmprim_GHCziTypes_Izh_con_info,-8(%r12) movq %rax,0(%r12) leaq -7(%r12),%rbx addq $8,%rbp jmp *0(%rbp) ------------------- Question: can't it use cheap multiplication and shift instead of expensive division here? I know that such optimisation is implemented at least to some extent for C--. I suppose it also won't do anything smart for expressions like a*4 or a/4 for the same reason. With kind regards, Denys Rtveliashvili

On April 27, 2011 23:01:50 Denys Rtveliashvili wrote:
Question 1: what is the meaning of those magic numbers -9223372036854775808, -6677706553, -7418931981405, -8257271295304186? Question 2: under which circumstances those strange branches of execution will be used and what those results would mean? Question 3: why is the Core for 'foo' so different to 'bar'?
The largest representable 64bit negative number in twos complement is -9223372036854775808. It doesn't have a positive counter part as 0 eats one out of the range of positive numbers. The code for "quot x y" therefore checks for the special case "x==minBound" and "y==-1" in order to be able to thrown an overFlowError. I suspect what you are seeing is a specialization for this branch. That is, the start of the overflow case check introduced a branch for x==-9223372036854775808. The compiler then notices that it know both x and y here, so it just puts in the answer, resulting in a permanent special case. Your foo and bar code is so different as direct use of the quotInt# primitive avoids this check and hence the resulting specialization for it. Cheers! -Tyson

On 27 April 2011 20:01, Denys Rtveliashvili
The lack of expected magic is in the assembler code: -------------------
addq $16,%r12 cmpq 144(%r13),%r12 ja .Lcz1 movl $1117,%ecx movl $1113,%r10d movl $1111,%r11d movq 7(%rbx),%rax cqto idivq %r11 cqto idivq %r10 cqto idivq %rcx movq $ghczmprim_GHCziTypes_Izh_con_info,-8(%r12) movq %rax,0(%r12) leaq -7(%r12),%rbx addq $8,%rbp jmp *0(%rbp)
------------------- Question: can't it use cheap multiplication and shift instead of expensive division here? I know that such optimisation is implemented at least to some extent for C--. I suppose it also won't do anything smart for expressions like a*4 or a/4 for the same reason.
There isn't really any optimisation done on Cmm and the native code generator doesn't do much optimisation itself, hence you get the more direct forward translation. This kind of code is where the LLVM backend does well in comparison. I haven't tried benchmarking the performance of -fasm vs -fllvm for this code but if you eyeball the assembly code produced by -fllvm then you'll see it uses shifts and other magic. Cheers, David

Hi David,
------------------- Question: can't it use cheap multiplication and shift instead of expensive division here? I know that such optimisation is implemented at least to some extent for C--. I suppose it also won't do anything smart for expressions like a*4 or a/4 for the same reason. There isn't really any optimisation done on Cmm and the native code generator doesn't do much optimisation itself, hence you get the more direct forward translation. This kind of code is where the LLVM backend does well in comparison. I haven't tried benchmarking the performance of -fasm vs -fllvm for this code but if you eyeball the assembly code produced by -fllvm then you'll see it uses shifts and other magic.
Cheers, David
Well.. I found some places in C-- compiler which are supposed to convert division and multiplication by 2^n into shifts. And I believe these work sometimes. However in this case I am a bit puzzled because even if I change the constants in my example to 2^n like 1024 the code is not optimised. By the way, is there any kind of documentation on how to hack C-- compiler? In particular, I am interested in: * how to run its optimiser against some C-- code and see what does it do * knowing more about its internals With kind regards, Denys Rtveliashvili

Excerpts from Denys Rtveliashvili's message of Thu Apr 28 04:41:48 -0400 2011:
Well.. I found some places in C-- compiler which are supposed to convert division and multiplication by 2^n into shifts. And I believe these work sometimes.
However in this case I am a bit puzzled because even if I change the constants in my example to 2^n like 1024 the code is not optimised.
You are referring to the mini-optimizer in cmm/CmmOpt.hs, correct? Specifically: cmmMachOpFold mop args@[x, y@(CmmLit (CmmInt n _))] = case mop of MO_Mul rep | Just p <- exactLog2 n -> cmmMachOpFold (MO_Shl rep) [x, CmmLit (CmmInt p rep)] MO_U_Quot rep | Just p <- exactLog2 n -> cmmMachOpFold (MO_U_Shr rep) [x, CmmLit (CmmInt p rep)] MO_S_Quot rep | Just p <- exactLog2 n, CmmReg _ <- x -> -- We duplicate x below, hence require See the third case. This appears to be something of a delicate special case, in particular, the incoming argument is required to be a register, which is not the case in many instances: sef_ret() { [const 0;, const 34;] } ceq: Hp = Hp + 8; if (Hp > I32[BaseReg + 92]) goto ceu; _seg::I32 = %MO_S_Quot_W32(I32[R1 + 3], 1024); <-- oops, it's a memory load I32[Hp - 4] = GHC.Types.I#_con_info; I32[Hp] = _seg::I32; R1 = Hp - 3; Sp = Sp + 4; jump I32[Sp] (); ceu: I32[BaseReg + 112] = 8; jump (I32[BaseReg - 8]) (); } (This is optimized Cmm, which you can get with -ddump-opt-cmm). Multiplication, on the other hand, manages to pull it off more frequently: sef_ret() { [const 0;, const 34;] } ceq: Hp = Hp + 8; if (Hp > I32[BaseReg + 92]) goto ceu; _seg::I32 = I32[R1 + 3] << 10; I32[Hp - 4] = GHC.Types.I#_con_info; I32[Hp] = _seg::I32; R1 = Hp - 3; Sp = Sp + 4; jump I32[Sp] (); ceu: I32[BaseReg + 112] = 8; jump (I32[BaseReg - 8]) (); } This might be a poor interaction with the inliner. I haven't investigated fully though.
By the way, is there any kind of documentation on how to hack C-- compiler? In particular, I am interested in: * how to run its optimiser against some C-- code and see what does it do * knowing more about its internals
GHC supports compiling C-- code; just name your file with a .cmm extension and GHC will parse it and, if it's the native backend, do some minor optimizations and register allocation. As usual, the GHC Trac has some useful information: - http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/CmmType - http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/Backends/NCG I also highly recommend reading cmm/OldCmm.hs and cmm/CmmExpr.hs, which explain the internal AST we use for Cmm, as well as cmm/OldCmmPpr.hs and cmm/CmmParse.y (and cmm/CmmLex.x) to understand textual C--. Note that there is also a "new" C-- representation hanging around that is not too interesting for you, since we don't use it at all without the flag -fnew-codegen. Edward

| cmm/CmmLex.x) to understand textual C--. Note that there is also a "new" C-- | representation hanging around that is not too interesting for you, since we don't | use it at all without the flag -fnew-codegen. Although ultimately we hope to move to the new rep and abandon the old one. Simon
participants (5)
-
David Terei
-
Denys Rtveliashvili
-
Edward Z. Yang
-
Simon Peyton-Jones
-
Tyson Whitehead