Re: [Haskell-cafe] A Performance Puzzle

4 Aug 2018

      Hi Gregory,

Gregory Wright wrote:
...
Something Haskell has lacked for a long time is a good medium-duty
linear system solver based on the LU decomposition.  There are bindings
to the usual C/Fortran libraries, but not one in pure Haskell.  (An
example "LU factorization" routine that does not do partial pivoting has
been around for years, but lacking pivoting it can fail unexpectedly on
well-conditioned inputs.  Another Haskell LU decomposition using partial
pivoting is around, but it uses an inefficient representation of the
pivot matrix, so it's not suited to solving systems of more than 100 x
100, say.)
There are ways to improve the code; in particular, one can change the
matrix multiplication to accumulate the result of multiplying a row
by a column:

    numLoopState 0 (ca - 1) 0 $ \s k -> do
        aik <- MU.unsafeRead a (i, k)
        bkj <- MU.unsafeRead b (k, j)
        return $! s + aik * bkj

At the end of the day, ghc's native code generator is pretty terrible
for tight inner loops as they arise in matrix multiplication. The
above results in an assembly language loop that does lots of useless
shuffling of registers (saving and restoring them on the stack), index
recomputation, and even checking a pointer tag to see if some closure
has already been evaluated, all in the innermost loop. (Full code at
end.)

If you use -fllvm, the code becomes quite a bit faster, and the inner
loop looks amazingly decent; there is no saving of state anymore, no
tag check, and the index computation has mostly disappeared thanks to
strength reduction:

  4114a0:       f2 0f 10 0f             movsd  (%rdi),%xmm1
     ; fetch entry from first matrix
  4114a4:       f2 0f 59 0b             mulsd  (%rbx),%xmm1
     ; multiply by entry from second matrix
  4114a8:       f2 0f 58 c1             addsd  %xmm1,%xmm0
     ; add to accumulator
  4114ac:       49 ff c5                inc    %r13
     ; increment loop counter
  4114af:       4c 01 e3                add    %r12,%rbx
     ; advance pointer into second matrix
  4114b2:       48 83 c7 08             add    $0x8,%rdi
     ; advance pointer into first matrix
  4114b6:       4c 39 ee                cmp    %r13,%rsi
     ; test for end of loop
  4114b9:       75 e5                   jne    4114a0 

But this could be quite a bit faster if the code were vectorized, for
example, by processing 4 columns at the same time. I would expect a
good implementation of BLAS to do that.

To speed matrix multiplication up on a higher level, Strassen
multiplication comes to mind, though "at the price of a somewhat
reduced numerical stability" (Wikipedia).

Cheers,

Bertram

NCG inner loop:

_cdUi:
        cmpq %rbx,%r10          ; check loop upper bound
        je _cdUx                ; exit loop
_cdUr:
        movq $block_cdUp_info,-96(%rbp)
        movq %rbx,%r11          ; save %rbx in %r11
        movq %rcx,%rbx
        movq %rdx,-88(%rbp)
        movq %rsi,-80(%rbp)
        movq %rax,-72(%rbp)
        movq %rdi,-64(%rbp)
        movq %r8,-56(%rbp)
        movq %r9,-48(%rbp)
        movq %r11,-40(%rbp)
        movq %rcx,-32(%rbp)
        movq %r14,-24(%rbp)
        movq %r10,-16(%rbp)
        movsd %xmm0,-8(%rbp)    ; save lots of state
        addq $-96,%rbp
        testb $7,%bl            ; check whether (rbx) is evaluated
        jne _cdUp               ; yes -> skip closure
_cdUs:
        jmp *(%rbx)             ; enter closure
.align 8
        .quad   122571
        .quad   30
block_cdUp_info:
_cdUp:
        movq 8(%rbp),%rax       ; was: %rdx
        movq 16(%rbp),%rcx      ; was: %rsi
        movq 24(%rbp),%rdx      ; was: %rax
        movq 32(%rbp),%rsi      ; was: %rdi
        movq 40(%rbp),%rdi      ; was: %8
        movq 48(%rbp),%r8       ; was: %r9
        movq 56(%rbp),%r9       ; was: %r11
        movq 64(%rbp),%r10      ; was: %rcx
        movq 72(%rbp),%r11      ; was: %r14
        movq 80(%rbp),%r14      ; was: %r10
        movq %rax,64(%rsp)      ; spill, was: %rdx
        movq %r14,%rax
        movq %rcx,72(%rsp)      ; spill, was: %rsi
        movq 64(%rsp),%rcx
        imulq %rcx,%rax
        addq %r11,%rax
        movq %rsi,%rcx
        addq %rax,%rcx
        movq 72(%rsp),%rax
        addq %rcx,%rax
        movq 7(%rbx),%rbx
        movq 64(%rsp),%rcx
        imulq %rcx,%rbx
        addq %r14,%rbx
        movq %rdi,%rcx
        addq %rbx,%rcx
        movq 72(%rsp),%rbx
        addq %rcx,%rbx
        movsd 16(%rdx,%rbx,8),%xmm0     ; read one array entry
        mulsd 16(%rdx,%rax,8),%xmm0     ; multiply by other array entry
        movsd 88(%rbp),%xmm1            ; add accumulator (was: %xmm0)
        addsd %xmm0,%xmm1
        addq $96,%rbp
        incq %r14
_nefT:
        movsd %xmm1,%xmm0       ; update accumulator
        movq %r10,%rcx          ; was: %rcx
        movq %r14,%r10          ; was: %r10 (now + 1)
        movq %r11,%r14          ; was: %r9
        movq %r9,%rbx           ; was: %r11 (where we saved %rbx earlier)
        movq %r8,%r9            ; was: %r9
        movq %rdi,%r8           ; was: %r8
        movq %rsi,%rdi          ; was: %rdi
        movq %rdx,%rax          ; was: %rax
        movq 72(%rsp),%rsi      ; was: %rsi
        movq 64(%rsp),%rdi      ; was: %rdx ???
        jmp _cdUi               ; loop

Re: [Haskell-cafe] A Performance Puzzle

Bertram Felgenhauer