
Hi Gregory, Gregory Wright wrote:
Something Haskell has lacked for a long time is a good medium-duty linear system solver based on the LU decomposition. There are bindings to the usual C/Fortran libraries, but not one in pure Haskell. (An example "LU factorization" routine that does not do partial pivoting has been around for years, but lacking pivoting it can fail unexpectedly on well-conditioned inputs. Another Haskell LU decomposition using partial pivoting is around, but it uses an inefficient representation of the pivot matrix, so it's not suited to solving systems of more than 100 x 100, say.)
There are ways to improve the code; in particular, one can change the
matrix multiplication to accumulate the result of multiplying a row
by a column:
numLoopState 0 (ca - 1) 0 $ \s k -> do
aik <- MU.unsafeRead a (i, k)
bkj <- MU.unsafeRead b (k, j)
return $! s + aik * bkj
At the end of the day, ghc's native code generator is pretty terrible
for tight inner loops as they arise in matrix multiplication. The
above results in an assembly language loop that does lots of useless
shuffling of registers (saving and restoring them on the stack), index
recomputation, and even checking a pointer tag to see if some closure
has already been evaluated, all in the innermost loop. (Full code at
end.)
If you use -fllvm, the code becomes quite a bit faster, and the inner
loop looks amazingly decent; there is no saving of state anymore, no
tag check, and the index computation has mostly disappeared thanks to
strength reduction:
4114a0: f2 0f 10 0f movsd (%rdi),%xmm1
; fetch entry from first matrix
4114a4: f2 0f 59 0b mulsd (%rbx),%xmm1
; multiply by entry from second matrix
4114a8: f2 0f 58 c1 addsd %xmm1,%xmm0
; add to accumulator
4114ac: 49 ff c5 inc %r13
; increment loop counter
4114af: 4c 01 e3 add %r12,%rbx
; advance pointer into second matrix
4114b2: 48 83 c7 08 add $0x8,%rdi
; advance pointer into first matrix
4114b6: 4c 39 ee cmp %r13,%rsi
; test for end of loop
4114b9: 75 e5 jne 4114a0