How to get GHC to produce ADC instructions for long addition

9 Nov 2015

      Here is some code that adds two 192 bit numbers, represented as three 64bit
machine words (well, on my machine anyway), and returns the result and any
carry:

{-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-} import GHC.Prim(
plusWord2#, Word#, or#) longAdd :: (# Word#, Word#, Word# #) -> (# Word#,
Word#, Word# #) -> (# Word#, (# Word#, Word#, Word# #) #) longAdd (# xl, xm,
xh #) (# yl, ym, yh #) = let plusWord2WithCarry x y c = let (# c1, r1 #) =
plusWord2# x y (# c2, r2 #) = plusWord2# r1 c in (# plusWord# c1 c2, r2 #)
(# cl, rl #) = plusWord2# xl yl (# cm, rm #) = plusWord2WithCarry xm ym cl
(# ch, rh #) = plusWord2WithCarry xh yh cm
in (# ch, (# rl, rm, rh #) #)

(My code covers words other than size 3 btw)

I'd like this to compile into something like:

add x1 y1
adc x2 y2
adc x3 y3

Unfortunately, I think my problem is with the "plusWord2WithCarry". As
there's no primitive operation in Haskell which adds two words and a carry.
What seems to happen when I look at the generated assembly is the following:

xor some_reg_1 some_reg_1
add x1 y1
adc $0 some_reg_1
xor some_reg_2 some_reg_2
adc x2 y2
adc $0 some_reg_2
xor some_reg_3 some_reg_3
adc some_reg_1 y2
adc $0 some_reg_3
add some_reg_2 some_reg_3
... and so on

Basically, instead of just using the carry flag in add of the next two
higher words, it instead saves it to a register, adds the next higher order
words without the carry, clears a register and then adds the carry to this
register, adds the first mentioned carry, against clears a register and
saves that carry, combines the two resulting carries and passes it to the
next higher order addition. I know that sounds complex, and the code is a
bit of a mess.

Bizarrely, when I send it through the LLVM backend it ends up worse,
generating a bunch of 32 bit shifts for reasons I can't understand.

I was hoping the LLVM backend would be able to produce the "adc"
instructions. Is there anything I could do to coax it into it. I know
writing it in C (or inline assembly in C) is an option, but after you add
the code to call and return such a small amount of work it seems hardly
worth it. I'd like to keep it in GHC so it can be inlined where
appropriate.

Any ideas on how to entice GHC (either natively or via LLVM) to produce
better code in this case?

Clinton Mead

tags

participants (1)