Why is Haskell so slow (comparing to Java/Scala)?

I've wrote simple Haskell benchmark program, which populated primitive vector from vector package: import Data.Vector.Primitive.Mutable as P vectorBench :: Benchmark vectorBench = bgroup "vector" [ primitive ] where primitive = bgroup "primitive" [ write1M ] where write1M = bench "write1M" $ nfIO $ do v <- P.unsafeNew 1000000 void $ for [0..1000000 - 1] $ flip (P.unsafeWrite v) (1 :: Int) return () I use `unsafeNew` to skip memory initialization and `unsafeWrite` to skip boundary checks, I guess it's fastest possible way to write something to vector. My result was about 40 ms. I wrote similar program in Scala: for (_ <- 1 to 5) { val start = System.currentTimeMillis() val a = new Array[Long](1000000) for (i <- 0 until 1000000) { a(i) = 1L } val end = System.currentTimeMillis() println(s"${end - start} ms") } I skip neither boundary checks nor memory initialization, I also used generic array here (suitable for any types of objects, not just for primitive types), so I expected longer run time. But what I got was shocking: 7 ms 3 ms 2 ms 1 ms 2 ms This program runs 20-40 times faster than Haskell after warm-up. 20-40 times, Carl! Why is Haskell soo slooooow? How can it be? -- Sincerely, Stanislav Chernichkin.

Hi Stanislav,
Станислав Черничкин
I've wrote simple Haskell benchmark program, which populated primitive vector from vector package:
... void $ for [0..1000000 - 1] $ flip (P.unsafeWrite v) (1 :: Int)
'for' is a variant of 'map' - it returns a list of results (in this case a list of a million () values). Instead you should use 'forM_' (note the underscore) from Control.Monad, which discards the result - that cuts the runtime by a huge amount when I try it. (And make sure to compile with -O too.) Nick

Am 20.09.2017 um 16:27 schrieb Станислав Черничкин:
I've wrote simple Haskell benchmark program, ... which means that the result is not really meaningful.
Large programs have their inefficiencies in needless recomputations, which happen as soon as no single programmer knows everything anymore and people start reinventing the wheel. Small programs do not need to be particularly efficient. Well, usually - sometimes they need to be, but these cases are very rare. In your particular case, the base cause was that your mental model of what Haskell was doing did not match what was really happening. This misdirected you to the wrong function and gave you surprising results. This is not your fault. Haskell's execution model is fundamentally different from that of most other languages, and it takes time and experience to explore all the consequences. It is not Haskell's fault either, either - Haskell was built to explore specifically this execution model. My impression is that it makes Haskell's performance somewhat less predictable, but allows you to build larger systems that "just work" - millions of line of code, instead of the few thousands that you get with typical imperative code. Of course, these things become noticeable only if you (a) have worked on software that is substantially larger than a few thousand lines of code, and (b) at least one of these projects was in Haskell, or any other language with good support for referentially transparent functions. I wouldn't concentrate on getting that benchmark fast. It's too early to get any meaningful data out of that. Essentially, while you can speed up Haskell that way, you are going towards a more imperative style, and you'll lose the real advantages of Haskell. For this reason, I wouldn't pursue those monad library functions. Not right now, anyway; they are pretty specialized, and you should have a better reason than performance to get back to them. If you wish to explore performance, try to make your code faster without too much library function use. You will get more surprising results, but you will get a better understanding of what happens. HTH Jo

To recap
You claimed a 40ms measurement. If we use `for_` instead of `void $
for` (for reasons mentioned earlier in the thread) and `-O2` when
compiling we get just under 1 ms:
```
time 876.2 μs (852.8 μs .. 896.9 μs)
0.994 R² (0.989 R² .. 0.998 R²)
mean 838.4 μs (825.0 μs .. 855.5 μs)
std dev 48.21 μs (36.91 μs .. 74.18 μs)
variance introduced by outliers: 48% (moderately inflated)
```
Cheers,
Thomas
On Wed, Sep 20, 2017 at 7:27 AM, Станислав Черничкин
I've wrote simple Haskell benchmark program, which populated primitive vector from vector package:
import Data.Vector.Primitive.Mutable as P
vectorBench :: Benchmark vectorBench = bgroup "vector" [ primitive ] where primitive = bgroup "primitive" [ write1M ] where write1M = bench "write1M" $ nfIO $ do v <- P.unsafeNew 1000000 void $ for [0..1000000 - 1] $ flip (P.unsafeWrite v) (1 :: Int) return ()
I use `unsafeNew` to skip memory initialization and `unsafeWrite` to skip boundary checks, I guess it's fastest possible way to write something to vector. My result was about 40 ms.
I wrote similar program in Scala:
for (_ <- 1 to 5) { val start = System.currentTimeMillis() val a = new Array[Long](1000000) for (i <- 0 until 1000000) { a(i) = 1L } val end = System.currentTimeMillis() println(s"${end - start} ms") }
I skip neither boundary checks nor memory initialization, I also used generic array here (suitable for any types of objects, not just for primitive types), so I expected longer run time. But what I got was shocking:
7 ms 3 ms 2 ms 1 ms 2 ms
This program runs 20-40 times faster than Haskell after warm-up. 20-40 times, Carl! Why is Haskell soo slooooow? How can it be?
-- Sincerely, Stanislav Chernichkin.
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Thomas- You and Nick have been very helpful.
I tried to post this in reply to another post, but I got an error, so
I don't know if it went through. If this ends up as a double post,
please forgive me.
Now, the challenge is, given a set of objectives, how do you design a
system or program to get the same results as c/c++ in roughly the same
execution time.
If there was a modeling process (like UML for imperative OO languages)
for Haskell, we could compare models and see the differences.
NOTE: I typically use LISP or Racket, so Haskell is not my strong suite.
A good resource is "Everything That Linguists Have Always Wanted To
Know About Logic *But Were Ashamed to Ask" by McCawley. This has such
as good explanation of Lambda Calculus that it is worth wading through
the whole book.
Another NOTE: When using Symbolic Logic I use the Lukasiewicz notation
instead of the Principia notation. This made LISP, Racket and Scheme
really easy for me to think about. Haskell is still a little
different, but it helps there, too.
Mike Burke
Quoting Thomas DuBuisson
To recap
You claimed a 40ms measurement. If we use `for_` instead of `void $ for` (for reasons mentioned earlier in the thread) and `-O2` when compiling we get just under 1 ms:
``` time 876.2 μs (852.8 μs .. 896.9 μs) 0.994 R² (0.989 R² .. 0.998 R²) mean 838.4 μs (825.0 μs .. 855.5 μs) std dev 48.21 μs (36.91 μs .. 74.18 μs) variance introduced by outliers: 48% (moderately inflated) ```
Cheers, Thomas
On Wed, Sep 20, 2017 at 7:27 AM, Станислав Черничкин
wrote: I've wrote simple Haskell benchmark program, which populated primitive vector from vector package:
import Data.Vector.Primitive.Mutable as P
vectorBench :: Benchmark vectorBench = bgroup "vector" [ primitive ] where primitive = bgroup "primitive" [ write1M ] where write1M = bench "write1M" $ nfIO $ do v <- P.unsafeNew 1000000 void $ for [0..1000000 - 1] $ flip (P.unsafeWrite v) (1 :: Int) return ()
I use `unsafeNew` to skip memory initialization and `unsafeWrite` to skip boundary checks, I guess it's fastest possible way to write something to vector. My result was about 40 ms.
I wrote similar program in Scala:
for (_ <- 1 to 5) { val start = System.currentTimeMillis() val a = new Array[Long](1000000) for (i <- 0 until 1000000) { a(i) = 1L } val end = System.currentTimeMillis() println(s"${end - start} ms") }
I skip neither boundary checks nor memory initialization, I also used generic array here (suitable for any types of objects, not just for primitive types), so I expected longer run time. But what I got was shocking:
7 ms 3 ms 2 ms 1 ms 2 ms
This program runs 20-40 times faster than Haskell after warm-up. 20-40 times, Carl! Why is Haskell soo slooooow? How can it be?
-- Sincerely, Stanislav Chernichkin.
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

* Станислав Черничкин:
I wrote similar program in Scala:
for (_ <- 1 to 5) { val start = System.currentTimeMillis() val a = new Array[Long](1000000) for (i <- 0 until 1000000) { a(i) = 1L } val end = System.currentTimeMillis() println(s"${end - start} ms") }
I skip neither boundary checks nor memory initialization, I also used generic array here (suitable for any types of objects, not just for primitive types), so I expected longer run time. But what I got was shocking:
7 ms 3 ms 2 ms 1 ms 2 ms
This program runs 20-40 times faster than Haskell after warm-up. 20-40 times, Carl! Why is Haskell soo slooooow? How can it be?
It is possible that Hotspot optimizes away most of the code, perhaps even the entire array allocation. Meaningful benchmarking is quite difficult due to such effects. Does the execution time even change if you increase the number of iterations (inner or outer)?

Am 23.10.2017 um 23:49 schrieb Florian Weimer:
It is possible that Hotspot optimizes away most of the code, perhaps even the entire array allocation. Meaningful benchmarking is quite difficult due to such effects.
Does the execution time even change if you increase the number of iterations (inner or outer)?
I tried that in Java, with this code: private static final int SIZE = 400_0000; private static final int ITERATIONS = 20; public static void main(String[] args) { for (int i = 1; i < ITERATIONS; i++) { long start = System.currentTimeMillis(); Long [] a = new Long[SIZE]; for (int j = 0; j < SIZE; j++) { a[j] = 1L; } long end = System.currentTimeMillis(); System.out.println((end - start) + " ms"); } } Fiddling with SIZE and ITERATIONS led me to the following conclusions: 1) The first two iterations are warm-up, afterwards the times are pretty constant (about 7-8 ms for SIZE=100_000). 2) Cranking ITERATIONS up gave semi-regular slow iterations. I blame GC. 3) Per-loop run time seems to be roughly linear with array size, that's a strong indicator that the assignments are not optimized away. (In fact having a running time of 1-2 ms is a strong indicator that the software is in fact blowing some real CPU cycles.) It is actually possible that the standard HotSpot JVM is faster than Haskell, for this type of task - i.e. just array manipulation with primitive values, and working set size small enough to make GC irrelevant. It's the kind of workload where HotSpot has all the information needed to fully optimize everything short of eliminating the loop altogether (which is usually not worth it for a mutable-language just-in-time compiler). Make things a bit more complicated and real-world, and this advantage will shrink considerably: Throw in a few classes, subclasses that override functions (Haskell equivalent would be unevaluated expressions in the list), iterate over a list instead of an array (which will actually make the Haskell program faster if the list spine can be optimized away), that kind of stuff. That said, HotSpot is a pretty tough opponent to beat. It has person-decates if not person-centuries of optimization under its belt after all. So does GHC, so maybe the comparison isn't so unfair after all, but saying "Haskell is slow" from a single microbenchmark simply does not hold any value. Still, it is possible that the JVM is indeed faster for some kinds of workload. If that workload is actually relevant, it might be a valid optimization potential. Just my 2 cents :-) Regards, Jo
participants (6)
-
Florian Weimer
-
Joachim Durchholz
-
meburke@rocomai.com
-
Nick Smallbone
-
Thomas DuBuisson
-
Станислав Черничкин