How to get GHC to produce ADC instructions for long addition

Here is some code that adds two 192 bit numbers, represented as three 64bit machine words (well, on my machine anyway), and returns the result and any carry: {-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-} import GHC.Prim( plusWord2#, Word#, or#) longAdd :: (# Word#, Word#, Word# #) -> (# Word#, Word#, Word# #) -> (# Word#, (# Word#, Word#, Word# #) #) longAdd (# xl, xm, xh #) (# yl, ym, yh #) = let plusWord2WithCarry x y c = let (# c1, r1 #) = plusWord2# x y (# c2, r2 #) = plusWord2# r1 c in (# plusWord# c1 c2, r2 #) (# cl, rl #) = plusWord2# xl yl (# cm, rm #) = plusWord2WithCarry xm ym cl (# ch, rh #) = plusWord2WithCarry xh yh cm in (# ch, (# rl, rm, rh #) #) (My code covers words other than size 3 btw) I'd like this to compile into something like: add x1 y1 adc x2 y2 adc x3 y3 Unfortunately, I think my problem is with the "plusWord2WithCarry". As there's no primitive operation in Haskell which adds two words and a carry. What seems to happen when I look at the generated assembly is the following: xor some_reg_1 some_reg_1 add x1 y1 adc $0 some_reg_1 xor some_reg_2 some_reg_2 adc x2 y2 adc $0 some_reg_2 xor some_reg_3 some_reg_3 adc some_reg_1 y2 adc $0 some_reg_3 add some_reg_2 some_reg_3 ... and so on Basically, instead of just using the carry flag in add of the next two higher words, it instead saves it to a register, adds the next higher order words without the carry, clears a register and then adds the carry to this register, adds the first mentioned carry, against clears a register and saves that carry, combines the two resulting carries and passes it to the next higher order addition. I know that sounds complex, and the code is a bit of a mess. Bizarrely, when I send it through the LLVM backend it ends up worse, generating a bunch of 32 bit shifts for reasons I can't understand. I was hoping the LLVM backend would be able to produce the "adc" instructions. Is there anything I could do to coax it into it. I know writing it in C (or inline assembly in C) is an option, but after you add the code to call and return such a small amount of work it seems hardly worth it. I'd like to keep it in GHC so it can be inlined where appropriate. Any ideas on how to entice GHC (either natively or via LLVM) to produce better code in this case?
participants (1)
-
Clinton Mead