Floyd Warshall performance (again)

Dear haskell-cafe, I implemented the Floyd Warshall algorithm for finding the shortest path in a dense graph in Haskell, but noted the performance was extremely poor compared to C. Even using mutable unboxed arrays it was running about 30 times slower. I rewrote the program several times, to arrive at the following definition: module Main where import Data.Array.Base import Data.Array.ST import Data.Array.Unboxed import Control.Monad.ST import Control.Monad import System.IO import GHC.Exts #define SIZE 1500 -- Actual graph is immaterial for this benchmark. graph = [] shortestPath :: [(Int, Int, Int)] -> UArray Int Int shortestPath g = runSTUArray $ do let mnew = newArray (0, SIZE * SIZE) 1 mread arr i j = unsafeRead arr (i * SIZE + j) mwrite arr i j x = unsafeWrite arr (i * SIZE + j) x unsafeIOToST $ hSetBuffering stdout LineBuffering unsafeIOToST $ putStrLn "Allocating ..." pm <- mnew unsafeIOToST $ putStrLn "Allocating ... done" let loop1 SIZE = return () loop1 k = let loop2 SIZE = return () loop2 i = let loop3 SIZE = return () loop3 j = do xij <- mread pm i j xik <- mread pm i k xkj <- mread pm k j mwrite pm i j (min xij (xik + xkj)) loop3 (j + 1) in loop3 0 >> loop2 (i + 1) in loop2 0 >> loop1 (k + 1) loop1 0 return pm main = shortestPath graph `seq` return () These 3 nested loops run a lot faster than generating a list of IO actions and then running them using sequence_. But still this program runs 3 times slower than it's C counterpart: $ time ./FloydWarshall Allocating ... Allocating ... done ./FloydWarshall 45,75s user 0,09s system 97% cpu 46,815 total $ time ./a.out ./a.out 14,92s user 0,03s system 98% cpu 15,230 total The source code for the C program is given below. Using the vector package instead of GHC's inbuilt ST arrays doesn't improve the situation any, neither does compiling via C. Any idea why the rather unidiomatic Haskell program is still 3 times slower, despite using unchecked indexing and the fact that the three nested tail recursive functions above shoud behave exactly as the 3 nested for loops in the C program? Best regards, Mathieu

Hello Mathieu, Friday, April 16, 2010, 12:06:06 PM, you wrote:
actions and then running them using sequence_. But still this program runs 3 times slower than it's C counterpart:
ghc low-level code optimization cannot be compared with best modern C compilers that's result of 20 years of optimization. ghc generates machine code in rather simple idiomatic way, so it should be compared to non-optimizing C compiler another haskell compiler, jhc, generates idiomatic C code, that therefore can be compiled by gcc to efficient machine code. but overall jhc is pretty experimental compiler ATM -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hi Bulat,
ghc low-level code optimization cannot be compared with best modern C compilers that's result of 20 years of optimization. ghc generates machine code in rather simple idiomatic way, so it should be compared to non-optimizing C compiler
Sure. But I was curious if to see whether there was some optimization
I had missed, seeing as other similarly low level programs, such as
the nsieve benchmark of the language shootout, or the word counting
program, manage to run within a few percentage points of C if not
faster. Since this program doesn't use any features specific to
functional programming, such as higher order functions, and mostly
just calls out to imperative primitives of GHC not implemented in
Haskell (such as unsafeRead and unsafeWrite), I would have thought
that the gap in runtimes might have been smaller. Fortunately for me,
using the FFI is really quite easy so for the actual problem I am
trying to solve I just call out to a C implementation of the Floyd
Warshall algorithm.
I forgot to paste the source code for the C program. FWIW, here it is!
#include

Hello Mathieu, Friday, April 16, 2010, 12:42:29 PM, you wrote:
Sure. But I was curious if to see whether there was some optimization I had missed, seeing as other similarly low level programs, such as the nsieve benchmark of the language shootout, or the word counting program, manage to run within a few percentage points of C if not faster.
you know it's the big game with $00000's hanging around. sometimes we rewrite programs, sometimes libs and sometimes compiler itself. the best way to optimize your program is to add it to shootout itself :))) i know one case when required function was added to the library and i know that ghc was added better code generation for short loops. probably it was enough to shootout programs but not your one. but of course if making full-fledged optimizing compiler was so easy, it was made to C and Haskell many years ago
Since this program doesn't use any features specific to functional programming, such as higher order functions, and mostly just calls out to imperative primitives of GHC not implemented in Haskell (such as unsafeRead and unsafeWrite), I would have thought that the gap in runtimes might have been smaller.
it's not runtimes, but code generation. ghc -O2 should be rather close to gcc -O0. you just undervalue amount of work done in gcc in those 20 years :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

For what is is worth:
$ ghc -cpp -O2 -ddump-asm Main.hs > Main.s
$ time ./a.out
Allocating ...
Allocating ... done
real 0m39.487s
user 0m39.258s
sys 0m0.150s
$ ~/Programming/Checkouts/ghc.llvm/inplace/bin/ghc-stage2 -cpp -fllvm
-O2 Main.hs
$ time ./a.out
Allocating ...
Allocating ... done
real 0m20.443s
user 0m20.281s
sys 0m0.101s
So you have an order of magnitude improvement with the LLVM backend.
It looks to me like your Core code is near-optimal, so the performance
problem is all down to the backend to fix up.
However:
$ gcc -std=c99 -O2 Main.c
$ time ./a.out
real 0m9.120s
user 0m9.030s
sys 0m0.035s
We still have a long way to go!
Cheers,
Max
On 16 April 2010 09:06, Mathieu Boespflug
Dear haskell-cafe,
I implemented the Floyd Warshall algorithm for finding the shortest path in a dense graph in Haskell, but noted the performance was extremely poor compared to C. Even using mutable unboxed arrays it was running about 30 times slower. I rewrote the program several times, to arrive at the following definition:
module Main where
import Data.Array.Base import Data.Array.ST import Data.Array.Unboxed import Control.Monad.ST import Control.Monad import System.IO import GHC.Exts
#define SIZE 1500
-- Actual graph is immaterial for this benchmark. graph = []
shortestPath :: [(Int, Int, Int)] -> UArray Int Int shortestPath g = runSTUArray $ do let mnew = newArray (0, SIZE * SIZE) 1 mread arr i j = unsafeRead arr (i * SIZE + j) mwrite arr i j x = unsafeWrite arr (i * SIZE + j) x unsafeIOToST $ hSetBuffering stdout LineBuffering unsafeIOToST $ putStrLn "Allocating ..." pm <- mnew unsafeIOToST $ putStrLn "Allocating ... done" let loop1 SIZE = return () loop1 k = let loop2 SIZE = return () loop2 i = let loop3 SIZE = return () loop3 j = do xij <- mread pm i j xik <- mread pm i k xkj <- mread pm k j mwrite pm i j (min xij (xik + xkj)) loop3 (j + 1) in loop3 0 >> loop2 (i + 1) in loop2 0 >> loop1 (k + 1) loop1 0 return pm
main = shortestPath graph `seq` return ()
These 3 nested loops run a lot faster than generating a list of IO actions and then running them using sequence_. But still this program runs 3 times slower than it's C counterpart:
$ time ./FloydWarshall Allocating ... Allocating ... done ./FloydWarshall 45,75s user 0,09s system 97% cpu 46,815 total
$ time ./a.out ./a.out 14,92s user 0,03s system 98% cpu 15,230 total
The source code for the C program is given below. Using the vector package instead of GHC's inbuilt ST arrays doesn't improve the situation any, neither does compiling via C. Any idea why the rather unidiomatic Haskell program is still 3 times slower, despite using unchecked indexing and the fact that the three nested tail recursive functions above shoud behave exactly as the 3 nested for loops in the C program?
Best regards,
Mathieu _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 16/04/2010, at 18:06, Mathieu Boespflug wrote:
shortestPath :: [(Int, Int, Int)] -> UArray Int Int shortestPath g = runSTUArray $ do let mnew = newArray (0, SIZE * SIZE) 1 mread arr i j = unsafeRead arr (i * SIZE + j) mwrite arr i j x = unsafeWrite arr (i * SIZE + j) x unsafeIOToST $ hSetBuffering stdout LineBuffering unsafeIOToST $ putStrLn "Allocating ..." pm <- mnew unsafeIOToST $ putStrLn "Allocating ... done" let loop1 SIZE = return () loop1 k = let loop2 SIZE = return () loop2 i = let loop3 SIZE = return () loop3 j = do xij <- mread pm i j xik <- mread pm i k xkj <- mread pm k j mwrite pm i j (min xij (xik + xkj)) loop3 (j + 1) in loop3 0 >> loop2 (i + 1) in loop2 0 >> loop1 (k + 1) loop1 0 return pm
In general, GHC doesn't like nested loops. You might want to try the following structure: loop1 SIZE = return () loop1 k = loop2 k 0 loop2 k SIZE = loop1 (k+1) loop2 k i = loop3 k i 0 loop3 k i SIZE = loop2 k (i+1) loop3 k i j = do ... loop3 k i (j+1) And, as Max suggested, the llvm backend ought to improve things. Roman
participants (4)
-
Bulat Ziganshin
-
Mathieu Boespflug
-
Max Bolingbroke
-
Roman Leshchinskiy