Performance of Idiomatic lazy Haskell

Markus Böhm

31 Jan 2010 31 Jan '10

10:14 a.m.

As a beginner fascinated about Haskell's lazy, idiomatic style I tried a small example: Calculate a numerical approximation of PI based on the Leibniz formula. Variant 1: My lazy, idiomatic approach (as I understand it so far). It takes on my machine with eps = 0.00000001 approx. 25 sec. import Data.List main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let pi14 = calcpi14 eps putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) calcpi14:: Double->Double calcpi14 eps = foldl' (+) 0 $ takeWhile (\x -> 4 * abs x > eps) ev where e = map (\x->1.0/x) [1.0,3.0..] ev = zipWith (*) e $ cycle [1.0,-1.0] Variant 2: My "C-style" approach achieving state via additonal function parameters. It takes on my machine with eps = 0.00000001 approx. 5 sec. import Data.List main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let pi14 = pisum 1 3 False eps putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) pisum s i _ eps | 4/i < eps = s pisum s i True eps = let h = s + 1/i in h `seq` pisum h (i+2) False eps pisum s i False eps = let h = s - 1/i in h `seq` pisum h (i+2) True eps I'd like to do idiomatic, lazy Haskell w/o such a performance penalty? How can I learn/achive that? -- Markus

Show replies by date

Felipe Lessa

31 Jan 31 Jan

11:28 a.m.

On Sun, Jan 31, 2010 at 11:14:36AM +0100, Markus Böhm wrote:

...

Variant 1: My lazy, idiomatic approach (as I understand it so far). It takes on my machine with eps = 0.00000001 approx. 25 sec.

Variant 2: My "C-style" approach achieving state via additonal function parameters. It takes on my machine with eps = 0.00000001 approx. 5 sec.

You could try the stream-fusion package on Hackage: http://hackage.haskell.org/package/stream-fusion It should give better performance on this kind of code since it can see the loops through the lists. Don't forget to compile with optimizations! :) HTH, -- Felipe.

Stephen Tetley

11:37 a.m.

Hi Markus I haven't tested the efficiency, but an unfold seems natural / idiomatic in this case. The Leibniz formula generates a list in the first step then sums it. Generating a list is often a perfect fit for an unfold. The code below seems quite close to the idea of the Leibniz formula (add 2 to the denominator, flip the sign at each step), sum and multiple by 4.

...

import Data.List (unfoldr)

...

leibniz n = (4 *) $ sum $ take n step

-- Note this unfoldr generates an infinite list (both cases produce Just values) ...

...

step :: [Double] step = unfoldr phi (True,1) where phi (sig,d) | sig == True = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2))

Alternatively, an unfoldr with a stop condition might seem more natural, creating a bounded list. However, here it is a bit ugly as there are now 3 elements in the unfold state:

...

leibniz' n = (4 *) $ sum $ step' n

...

step' :: Int -> [Double] step' n = unfoldr phi (0,True,1) where phi (i,_,_) | i == n = Nothing phi (i,sig,d) | sig == True = Just (1/d, (i+1,False,d+2)) | otherwise = Just (negate (1/d), (i+1,True,d+2))

Best wishes Stephen

Stephen Tetley

12:23 p.m.

Hi Markus Whoops, I hadn't read your email properly and wasn't accounting for the epsilon. Here's a version that does, although it is perhaps a little slow...

...

import Data.List (unfoldr)

...

leibniz eps = converge eps ser

...

ser :: [Double] ser = unfoldr phi (True,1) where phi (sig,d) | sig == True = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2))

...

converge :: Double -> [Double] -> Double converge eps xs = step 0 0 xs where step a b (x:xs) = let a' = a + (4*x) in if abs (a'-b) < eps then a' else step a' a xs

...

demo1 = leibniz 0.00000000025

Best wishes Stephen

Daniel Fischer

1:39 p.m.

Am Sonntag 31 Januar 2010 13:23:33 schrieb Stephen Tetley:

...

Hi Markus

Whoops, I hadn't read your email properly and wasn't accounting for the epsilon. Here's a version that does, although it is perhaps a little slow...

Better use the previous and calculate how many terms you need: ============================================ module Main (main) where import Data.List (unfoldr) main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ sum $ take n step step :: [Double] step = unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) giving $ echo '0.00000001' | ./unfoldPi +RTS -sstderr -RTS ./unfoldPi +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.141592648589476 27,305,969,616 bytes allocated in the heap 2,523,788 bytes copied during GC 61,660 bytes maximum residency (1 sample(s)) 38,864 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 52083 collections, 0 parallel, 0.29s, 0.44s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 23.10s ( 23.14s elapsed) GC time 0.29s ( 0.44s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 23.39s ( 23.57s elapsed) %GC time 1.2% (1.9% elapsed) Alloc rate 1,182,002,769 bytes per MUT second Productivity 98.8% of total user, 98.0% of total elapsed =============================================== a little better if we do the unfolding ourselves, not creating any intermediate pairs: leibniz n = (4*) . sum $ unf n True 1 unf :: Int -> Bool -> Double -> [Double] unf 0 _ _ = [] unf k True n = 1/n : unf (k-1) False (n+2) unf k False n = negate (1/n) : unf (k-1) True (n+2) which gives $ echo '0.00000001' | ./unfPi +RTS -sstderr -RTS ./unfPi +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.141592648589476 15,250,801,168 bytes allocated in the heap 1,284,632 bytes copied during GC 61,592 bytes maximum residency (1 sample(s)) 38,864 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 29089 collections, 0 parallel, 0.22s, 0.32s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 18.05s ( 18.05s elapsed) GC time 0.22s ( 0.32s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 18.28s ( 18.37s elapsed) %GC time 1.2% (1.7% elapsed) Alloc rate 844,773,242 bytes per MUT second Productivity 98.8% of total user, 98.2% of total elapsed ============================================= Still not competitive with the loop version $ echo '0.00000001' | ./loopPi +RTS -sstderr -RTS ./loopPi +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.1415926526069526 136,248 bytes allocated in the heap 2,024 bytes copied during GC 51,720 bytes maximum residency (1 sample(s)) 38,392 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 6.64s ( 6.64s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 6.64s ( 6.64s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 20,517 bytes per MUT second Productivity 100.0% of total user, 100.0% of total elapsed which can also be improved: {-# LANGUAGE BangPatterns #-} module Main (main) where main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let mx = floor (4/eps) k = (mx-1) `quot` 2 !pi14 = pisum (even k) (fromInteger (2*k+1)) putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) -- sum from small numbers to large, to reduce cancellation -- although, in this particular case, for eps = 1e-8, the result is -- farther off than summing the other way pisum :: Bool -> Double -> Double pisum bl start = go bl start 0 where go _ n s | n < 1 = s go True n !s = go False (n-2) (s+recip n) go False n !s = go True (n-2) (s-recip n) $ echo '0.00000001' | ./mloopPi +RTS -sstderr -RTS ./mloopPi +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.141592648589793 136,416 bytes allocated in the heap 2,024 bytes copied during GC 51,720 bytes maximum residency (1 sample(s)) 38,392 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 4.55s ( 4.56s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.55s ( 4.56s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 29,966 bytes per MUT second Productivity 99.9% of total user, 99.8% of total elapsed Most of the gain is due to the much simpler loop-break test, a small bit may be due to bang-patterns vs. seq. Bottom line: GHC isn't very good at fusing away intermediate lists in strict algorithms (it's better at it for lazy algorithms). Allocating so many cons-cells just to be immediately garbage-collected costs a lot of time. You can code the loop directly, which gives reasonable results, or, as Felipe suggested, use the fusion framework created by experts: module Main (main) where import qualified Data.List.Stream as S main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ S.sum $ S.take n step step :: [Double] step = S.unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) which beats the hand-coded loop: $ echo '0.00000001' | ./sunfPi +RTS -sstderr -RTS ./sunfPi +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.1415926445727678 136,560 bytes allocated in the heap 2,024 bytes copied during GC 51,720 bytes maximum residency (1 sample(s)) 38,392 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 4.27s ( 4.27s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.27s ( 4.27s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 31,994 bytes per MUT second Productivity 100.0% of total user, 100.0% of total elapsed

...

...
import Data.List (unfoldr)

leibniz eps = converge eps ser

ser :: [Double] ser = unfoldr phi (True,1) where phi (sig,d) | sig == True = Just (1/d, (False,d+2))

| otherwise = Just (negate (1/d), (True,d+2))

converge :: Double -> [Double] -> Double converge eps xs = step 0 0 xs where step a b (x:xs) = let a' = a + (4*x) in if abs (a'-b) < eps then a' else step a' a xs

demo1 = leibniz 0.00000000025

Best wishes

Stephen

Stephen Tetley

2:52 p.m.

Hi Daniel Thanks - the figures are very impressive for the stream fusion library. I knew the paper, but I hadn't looked at it the implementation. Making a stricter unfoldr by using a strictness annotation on the state and getting rid of the tuple is nowhere near stream fusion-lib, i.e: data Maybe2 a st = Nothing2 | Just2 a !st deriving (Eq,Show) Performance wise stream fusion even beats a monoidal unfoldr. A monoidal unfoldr seems reasonable to me for this problem as there is no need to generate a list. As a monoidal unfoldr is not in the 'standard' libraries some people might not consider it idiomatic though. I put in a fixed value for epsilon in all three version rather than used echo at the command line and compiled all with ghc --make -O2 -------------- module Main (main) where import Data.Monoid data Maybe2 a st = Nothing2 | Just2 a !st deriving (Eq,Show) dummy_eps :: Double dummy_eps = 0.00000001 main :: IO () main = do putStrLn "EPS: " eps <- return dummy_eps let mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ getSum $ step n step :: Integer -> Sum Double step times = unfoldrMon phi (0,True,1) where phi (i,_,_) | i == times = Nothing2 phi (i,sig,d) | sig = Just2 (Sum (1/d)) (i+1,False,d+2) | otherwise = Just2 (Sum (negate (1/d))) (i+1,True,d+2) unfoldrMon :: Monoid a => (b -> Maybe2 a b) -> b -> a unfoldrMon f b = case f b of Just2 a new_b -> a `mappend` unfoldrMon f new_b Nothing2 -> mempty -- --------------------------------------------------- Leibniz1 (stream fusion) --------------------------------------------------- $ ./Leibniz1 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz1.exe +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.1415926445727678 24,404 bytes allocated in the heap 892 bytes copied during GC 3,068 bytes maximum residency (1 sample(s)) 13,316 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.03s ( 0.00s elapsed) MUT time 4.59s ( 4.59s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.63s ( 4.59s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 5,276 bytes per MUT second Productivity 99.3% of total user, 100.0% of total elapsed --------------------------------------------------- Leibniz3 (monoidal unfoldr) --------------------------------------------------- $ ./Leibniz3 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz3.exe +RTS -sstderr EPS: Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it. 92,448,984 bytes allocated in the heap 15,368 bytes copied during GC 8,382,800 bytes maximum residency (5 sample(s)) 4,198,684 bytes maximum slop 17 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 125 collections, 0 parallel, 1.66s, 1.66s elapsed Generation 1: 5 collections, 0 parallel, 0.03s, 0.03s elapsed INIT time 0.03s ( 0.00s elapsed) MUT time 3.50s ( 3.55s elapsed) GC time 1.69s ( 1.69s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 5.22s ( 5.23s elapsed) %GC time 32.3% (32.2% elapsed) Alloc rate 26,180,243 bytes per MUT second Productivity 67.1% of total user, 66.9% of total elapsed --------------------------------------------------- Leibniz2 - no stream fusion, unfoldr with strictness annotation on the state --------------------------------------------------- $ ./Leibniz2 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz2.exe +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.141592648589476 25,600,024,064 bytes allocated in the heap 2,152,224 bytes copied during GC 3,336 bytes maximum residency (1 sample(s)) 11,908 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 48828 collections, 0 parallel, 2.19s, 2.23s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.02s ( 0.00s elapsed) MUT time 75.16s ( 76.22s elapsed) GC time 2.19s ( 2.23s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 77.36s ( 78.45s elapsed) %GC time 2.8% (2.8% elapsed) Alloc rate 340,553,219 bytes per MUT second Productivity 97.2% of total user, 95.8% of total elapsed

Stephen Tetley

3:10 p.m.

Arghh, an error in that post that I didn't notice, I send a revision soon... Apologies Stephen

...

--------------------------------------------------- Leibniz3 (monoidal unfoldr) ---------------------------------------------------

$ ./Leibniz3 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz3.exe +RTS -sstderr EPS: Stack space overflow: current size 8388608 bytes.

Stephen Tetley

6:21 p.m.

Hello Daniel I can get close with a tail recursive + accumulator monoidal unfoldr and strict triple for the state (leibniz4). But it does seem to be loosing the point of the initial exercise "to be idiomatic", also the answers are divergent too... Best wishes Stephen ------ Leibniz4 - monodial unfoldr strict triple ------ $ ./Leibniz4 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz4.exe +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.1415926526069526 24,424 bytes allocated in the heap 892 bytes copied during GC 3,068 bytes maximum residency (1 sample(s)) 13,316 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.03s ( 0.00s elapsed) MUT time 4.34s ( 4.45s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.38s ( 4.45s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 5,582 bytes per MUT second Productivity 99.3% of total user, 97.5% of total elapsed ------ Leibniz1 - stream fusion ------ $ ./Leibniz1 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz1.exe +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.1415926445727678 24,404 bytes allocated in the heap 892 bytes copied during GC 3,068 bytes maximum residency (1 sample(s)) 13,316 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.02s ( 0.00s elapsed) MUT time 4.61s ( 4.63s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.63s ( 4.63s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 5,276 bytes per MUT second Productivity 99.7% of total user, 99.7% of total elapsed ---------- module Main (main) where import Data.Monoid data Maybe2 a st = Nothing2 | Just2 a !st deriving (Eq,Show) dummy_eps :: Double dummy_eps = 0.00000001 main :: IO () main = do putStrLn "EPS: " eps <- return dummy_eps let mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz :: Integer -> Double leibniz n = (4 *) $ getSum $ step (fromIntegral n) data Trip = T3 !Int !Bool !Double step :: Int -> Sum Double step times = unfoldrMon phi (T3 0 True 1) where phi :: Trip -> Maybe2 (Sum Double) Trip phi (T3 i _ _) | i == times = Nothing2 phi (T3 i sig d) | sig = Just2 (Sum (1/d)) (T3 (i+1) False (d+2)) | otherwise = Just2 (Sum (negate (1/d))) (T3 (i+1) True (d+2)) unfoldrMon :: Monoid a => (b -> Maybe2 a b) -> b -> a unfoldrMon f b0 = step b0 mempty where step b acc = case f b of Nothing2 -> acc Just2 a new_b -> step new_b (a `mappend` acc)

Daniel Fischer

6:50 p.m.

Am Sonntag 31 Januar 2010 19:21:10 schrieb Stephen Tetley:

...

Hello Daniel

I can get close with a tail recursive + accumulator monoidal unfoldr and strict triple for the state (leibniz4).

Beats stream fusion and direct loops comfortably here (3.55s) when compiled with the NCG, very narrowly when compiled via C (2.99s). Congrats.

...

But it does seem to be loosing the point of the initial exercise "to be idiomatic",

Yes. Returning to that, a) for tasks where it's appropriate/sufficiently natural, hand-coded loops are idiomatic, too. b) use stream-fusion to get rid of intermediate structures.

...

also the answers are divergent too...

In how far?

...

Best wishes

Stephen

Stephen Tetley

7:22 p.m.

On 31 January 2010 18:50, Daniel Fischer wrote:

...

Yes. Returning to that,

a) for tasks where it's appropriate/sufficiently natural, hand-coded loops are idiomatic, too.

Indeed - originally I looked at Markus's code and only saw that an unfoldr would be a natural implementation for the Leibniz generating function. I overlooked the episilon used to control the accuracy which pollutes the unfold and makes it no clearer (too much haste / speed on my part).

...

...
also the answers are divergent too...

In how far?

On my machine, they diverged at the eighth decimal place - Leibniz4 3.1415926526069526 Leibniz1 3.1415926445727678 Best wishes Stephen

Daniel Fischer

8:43 p.m.

Am Sonntag 31 Januar 2010 20:22:56 schrieb Stephen Tetley:

...

On my machine, they diverged at the eighth decimal place -

Leibniz4 3.1415926526069526 Leibniz1 3.1415926445727678

Ah, that. I thought you were talking about timings. There are several things that can lead to differing results here, 1) order of summation (I think only my loop had a different order) 2) due to different tests, it might happen that one algorithm evaluates one term more than the other 3) are intermediate results truncated to 64 bits or kept in an 80-bit register? I'm astonished by the size of the difference, though. But if you sum from small to large, the results are very accurately the theoretically expected results, so I think it's 1) together with 3) which make the difference.

Markus Böhm

1 Feb 1 Feb

9:37 a.m.

Daniel, Stephen, Felipe thanks for Your answers/advice, which I studied this morning. Helps me a lot on my learning curve. I guess it takes some time to get a feeling for the performance/runtime behaviour of abstractions in Haskell. Just for fun and w/o knowing Lua: I translated my Haskell "loops-while"-version into Lua and ran on LuaJit Beta 2.0.0. Seems to beat our best Haskell version. print("EPS: ") EPS = io.read("*number") pi2 = 1 i = 3 s = false repeat pi1 = pi2 if s then pi2 = pi1 + 1.0/i s = false else pi2 = pi1 - 1.0/i s = true end i = i+2 until EPS > 4.0 * math.abs(pi2-pi1) print("PI mit EPS", EPS," = ",4*pi1," n= ",(i-3)/2) -- Markus On Sun, Jan 31, 2010 at 9:43 PM, Daniel Fischer wrote:

...

Am Sonntag 31 Januar 2010 20:22:56 schrieb Stephen Tetley:

...
On my machine, they diverged at the eighth decimal place -

Leibniz4 3.1415926526069526 Leibniz1 3.1415926445727678

Ah, that. I thought you were talking about timings.

There are several things that can lead to differing results here,

1) order of summation (I think only my loop had a different order) 2) due to different tests, it might happen that one algorithm evaluates one term more than the other 3) are intermediate results truncated to 64 bits or kept in an 80-bit register?

I'm astonished by the size of the difference, though.

But if you sum from small to large, the results are very accurately the theoretically expected results, so I think it's 1) together with 3) which make the difference. _______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Daniel Fischer

1:16 p.m.

Am Montag 01 Februar 2010 10:37:41 schrieb Markus Böhm:

...

Daniel, Stephen, Felipe thanks for Your answers/advice, which I studied this morning. Helps me a lot on my learning curve. I guess it takes some time to get a feeling for the performance/runtime behaviour of abstractions in Haskell.

Just for fun and w/o knowing Lua: I translated my Haskell "loops-while"-version into Lua

That's not quite a translation of the loop you posted. Not really important, though, small difference.

...

and ran on LuaJit Beta 2.0.0. Seems to beat our best Haskell version.

So LuaJit Beta 2.0.0 beats gcc -O3 on such a simple loop? I'm not going to install Lua to test, but if that's really the case, I'm sure the folks at gnu.org would like to hear about it. Or don't you consider -fvia-C compiled loops to be true Haskell versions?

...

print("EPS: ") EPS = io.read("*number") pi2 = 1 i = 3 s = false repeat pi1 = pi2 if s then pi2 = pi1 + 1.0/i s = false else pi2 = pi1 - 1.0/i s = true end i = i+2

until EPS > 4.0 * math.abs(pi2-pi1)

print("PI mit EPS", EPS," = ",4*pi1," n= ",(i-3)/2)

-- Markus

Markus Böhm

1:31 p.m.

Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with attached variant. I compiled all our Haskell variants with ghc --make -O2. I don't know about -fvia-C, have to find out. I hope I didn't distract You with my Lua variant. In any case thank You very much for your advice. Markus. print("EPS: ") EPS = io.read("*number") local x = os.clock() pi2 = 1 i = 3 s = false repeat pi1 = pi2 if s then pi2 = pi1 + 1.0/i s = false else pi2 = pi1 - 1.0/i s = true end i = i+2 until EPS > 4.0 * math.abs(pi2-pi1) print("PI mit EPS", EPS," = ",4*pi1," n= ",(i-3)/2) print(string.format("elapsed time: %.2f\n", os.clock() -x)) -- Markus On Mon, Feb 1, 2010 at 2:16 PM, Daniel Fischer wrote:

...

Am Montag 01 Februar 2010 10:37:41 schrieb Markus Böhm:

...
Daniel, Stephen, Felipe thanks for Your answers/advice, which I studied this morning. Helps me a lot on my learning curve. I guess it takes some time to get a feeling for the performance/runtime behaviour of abstractions in Haskell.

Just for fun and w/o knowing Lua: I translated my Haskell "loops-while"-version into Lua

That's not quite a translation of the loop you posted. Not really important, though, small difference.

...
and ran on LuaJit Beta 2.0.0. Seems to beat our best Haskell version.

So LuaJit Beta 2.0.0 beats gcc -O3 on such a simple loop? I'm not going to install Lua to test, but if that's really the case, I'm sure the folks at gnu.org would like to hear about it. Or don't you consider -fvia-C compiled loops to be true Haskell versions?

...
print("EPS: ") EPS = io.read("*number") pi2 = 1 i = 3 s = false repeat pi1 = pi2 if s then pi2 = pi1 + 1.0/i s = false else pi2 = pi1 - 1.0/i s = true end i = i+2

until EPS > 4.0 * math.abs(pi2-pi1)

print("PI mit EPS", EPS," = ",4*pi1," n= ",(i-3)/2)

-- Markus

Daniel Fischer

1:52 p.m.

Am Montag 01 Februar 2010 14:31:40 schrieb Markus Böhm:

...

Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with attached variant.

That's pretty fast.

...

I compiled all our Haskell variants with ghc --make -O2.

Well, GHC isn't as good at loop-optimising as gcc is. Depending on the loop, the via-C compiled binaries are between 1.4 and 2.3 times faster than the NCG compiled.

...

I don't know about -fvia-C, have to find out. I hope I didn't distract You with my Lua variant.

No sweat.

...

In any case thank You very much for your advice. Markus.

You're welcome. Can you try the below with ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make Whatever.hs and run with echo '0.00000001' | time ./luaLoop ? It's a fairly direct translation of the Lua code, and it runs here more or less equally fast as (gcc compiled) C-loops. ================================= module Main (main) where main :: IO () main = do putStrLn "EPS:" eps <- readLn :: IO Double print $ 4*calcPi eps calcPi :: Double -> Double calcPi eps = go False 1 3 where go bl p1 i | 4*abs(p2-p1) < eps = p1 | otherwise = go (not bl) p2 (i+2) where p2 | bl = p1+1/i | otherwise = p1-1/i ==================================

Markus Böhm

2:08 p.m.

Daniel, I use GHC 6.12.1 and Windows XP. The time command doesn't seem to work. It says in German: specified time can't be read. Give a new time. -- Markus On Mon, Feb 1, 2010 at 2:52 PM, Daniel Fischer wrote:

...

Am Montag 01 Februar 2010 14:31:40 schrieb Markus Böhm:

...
Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with attached variant.

That's pretty fast.

...
I compiled all our Haskell variants with ghc --make -O2.

Well, GHC isn't as good at loop-optimising as gcc is. Depending on the loop, the via-C compiled binaries are between 1.4 and 2.3 times faster than the NCG compiled.

...
I don't know about -fvia-C, have to find out. I hope I didn't distract You with my Lua variant.

No sweat.

...
In any case thank You very much for your advice. Markus.

You're welcome.

Can you try the below with

ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make Whatever.hs

and run with

echo '0.00000001' | time ./luaLoop

? It's a fairly direct translation of the Lua code, and it runs here more or less equally fast as (gcc compiled) C-loops.

================================= module Main (main) where

main :: IO () main = do putStrLn "EPS:" eps <- readLn :: IO Double print $ 4*calcPi eps

calcPi :: Double -> Double calcPi eps = go False 1 3 where go bl p1 i | 4*abs(p2-p1) < eps = p1 | otherwise = go (not bl) p2 (i+2) where p2 | bl = p1+1/i | otherwise = p1-1/i

==================================

Markus Böhm

2:22 p.m.

Daniel, here another feedback: F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make p 085-pi_lualoop.hs [1 of 1] Compiling Main ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking luaLoop.exe ... F:\MeineUebungen>echo '0.00000001' | time ./luaLoop Eingegebene Zeit kann nicht übernommen werden. Geben Sie die neue Zeit ein: F:\MeineUebungen>luaLoop +RTS -sstderr -RTS luaLoop +RTS -sstderr EPS: 0.00000001 3.1415926485894725 61,428 bytes allocated in the heap 1,316 bytes copied during GC 4,564 bytes maximum residency (1 sample(s)) 11,820 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.02s ( 0.00s elapsed) MUT time 8.06s ( 16.83s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.08s ( 16.83s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 7,604 bytes per MUT second Productivity 99.8% of total user, 47.9% of total elapsed -- Markus On Mon, Feb 1, 2010 at 3:08 PM, Markus Böhm wrote:

...

Daniel, I use GHC 6.12.1 and Windows XP. The time command doesn't seem to work. It says in German: specified time can't be read. Give a new time.

-- Markus

On Mon, Feb 1, 2010 at 2:52 PM, Daniel Fischer wrote:

...
Am Montag 01 Februar 2010 14:31:40 schrieb Markus Böhm:

...
Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with attached variant.

That's pretty fast.

...
I compiled all our Haskell variants with ghc --make -O2.

Well, GHC isn't as good at loop-optimising as gcc is. Depending on the loop, the via-C compiled binaries are between 1.4 and 2.3 times faster than the NCG compiled.

...
I don't know about -fvia-C, have to find out. I hope I didn't distract You with my Lua variant.

No sweat.

...
In any case thank You very much for your advice. Markus.

You're welcome.

Can you try the below with

ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make Whatever.hs

and run with

echo '0.00000001' | time ./luaLoop

? It's a fairly direct translation of the Lua code, and it runs here more or less equally fast as (gcc compiled) C-loops.

================================= module Main (main) where

main :: IO () main = do putStrLn "EPS:" eps <- readLn :: IO Double print $ 4*calcPi eps

calcPi :: Double -> Double calcPi eps = go False 1 3 where go bl p1 i | 4*abs(p2-p1) < eps = p1 | otherwise = go (not bl) p2 (i+2) where p2 | bl = p1+1/i | otherwise = p1-1/i

==================================

Daniel Fischer

2:40 p.m.

Am Montag 01 Februar 2010 15:22:45 schrieb Markus Böhm:

...

Daniel, I use GHC 6.12.1 and Windows XP.

*sigh* Have you a CygWin installed? It should run as given from a CygWin shell, I believe.

...

The time command doesn't seem to work. It says in German: specified time can't be read. Give a new time.

Replace "time" with however you get the CPU time used by a process on Windows. But I think Windows doesn't allow piping, and I'm not sure whether it has echo, so see below.

...

Daniel, here another feedback:

F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make p 085-pi_lualoop.hs [1 of 1] Compiling Main ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking luaLoop.exe ...

F:\MeineUebungen>echo '0.00000001' | time ./luaLoop Eingegebene Zeit kann nicht übernommen werden. Geben Sie die neue Zeit ein:

F:\MeineUebungen>luaLoop +RTS -sstderr -RTS luaLoop +RTS -sstderr EPS: 0.00000001 3.1415926485894725 61,428 bytes allocated in the heap 1,316 bytes copied during GC 4,564 bytes maximum residency (1 sample(s)) 11,820 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.02s ( 0.00s elapsed) MUT time 8.06s ( 16.83s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.08s ( 16.83s elapsed)

%GC time 0.0% (0.0% elapsed)

Alloc rate 7,604 bytes per MUT second

Productivity 99.8% of total user, 47.9% of total elapsed

What????? Please try the following: 1) hardcode eps = 1e-8 in the source (or get it via getArgs and pass it as a command line argument) 2) ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o cloop -- make Source.hs 3) ghc -O2 -fforce-recomp -o nloop --make Source.hs 4) cloop +RTS -sstderr 5) nloop +RTS -sstderr

...

-- Markus

Markus Böhm

2:59 p.m.

1. I used Your lualoop file with content: module Main (main) where main :: IO () main = do --putStrLn "EPS:" -- eps <- readLn :: IO Double print $ 4*calcPi 0.00000001 calcPi :: Double -> Double calcPi eps = go False 1 3 where go bl p1 i | 4*abs(p2-p1) < eps = p1 | otherwise = go (not bl) p2 (i+2) where p2 | bl = p1+1/i | otherwise = p1-1/i 2. F:\MeineUebungen>ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o c loop --make p085-pi_lualoop.hs [1 of 1] Compiling Main ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking cloop.exe ... F:\MeineUebungen>ghc -O2 -fforce-recomp -o nloop --make p085-pi_lualoop.hs [1 of 1] Compiling Main ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking nloop.exe ... F:\MeineUebungen>cloop +RTS -sstderr cloop +RTS -sstderr 3.1415926485894725 20,860 bytes allocated in the heap 892 bytes copied during GC 3,068 bytes maximum residency (1 sample(s)) 13,316 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.02s ( 0.02s elapsed) MUT time 7.39s ( 7.27s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.41s ( 7.28s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 2,816 bytes per MUT second Productivity 99.8% of total user, 101.5% of total elapsed F:\MeineUebungen>nloop +RTS -sstderr nloop +RTS -sstderr 3.1415926485894725 20,860 bytes allocated in the heap 892 bytes copied during GC 3,068 bytes maximum residency (1 sample(s)) 13,316 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.02s ( 0.00s elapsed) MUT time 4.77s ( 4.81s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.78s ( 4.81s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 4,362 bytes per MUT second Productivity 99.7% of total user, 99.0% of total elapsed -- Markus On Mon, Feb 1, 2010 at 3:40 PM, Daniel Fischer wrote:

...

Am Montag 01 Februar 2010 15:22:45 schrieb Markus Böhm:

...
Daniel, I use GHC 6.12.1 and Windows XP.

*sigh*

Have you a CygWin installed? It should run as given from a CygWin shell, I believe.

...
The time command doesn't seem to work. It says in German: specified time can't be read. Give a new time.

Replace "time" with however you get the CPU time used by a process on Windows. But I think Windows doesn't allow piping, and I'm not sure whether it has echo, so see below.

...
Daniel, here another feedback:

F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make p 085-pi_lualoop.hs [1 of 1] Compiling Main ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking luaLoop.exe ...

F:\MeineUebungen>echo '0.00000001' | time ./luaLoop Eingegebene Zeit kann nicht übernommen werden. Geben Sie die neue Zeit ein:

F:\MeineUebungen>luaLoop +RTS -sstderr -RTS luaLoop +RTS -sstderr EPS: 0.00000001 3.1415926485894725 61,428 bytes allocated in the heap 1,316 bytes copied during GC 4,564 bytes maximum residency (1 sample(s)) 11,820 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.02s ( 0.00s elapsed) MUT time 8.06s ( 16.83s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.08s ( 16.83s elapsed)

%GC time 0.0% (0.0% elapsed)

Alloc rate 7,604 bytes per MUT second

Productivity 99.8% of total user, 47.9% of total elapsed

What?????

Please try the following:

1) hardcode eps = 1e-8 in the source (or get it via getArgs and pass it as a command line argument) 2) ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o cloop -- make Source.hs 3) ghc -O2 -fforce-recomp -o nloop --make Source.hs 4) cloop +RTS -sstderr 5) nloop +RTS -sstderr

...
-- Markus

Daniel Fischer

4:10 p.m.

Am Montag 01 Februar 2010 15:59:59 schrieb Markus Böhm:

...

1. I used Your lualoop file with content: module Main (main) where

main :: IO () main = do --putStrLn "EPS:" -- eps <- readLn :: IO Double print $ 4*calcPi 0.00000001

calcPi :: Double -> Double calcPi eps = go False 1 3 where go bl p1 i

| 4*abs(p2-p1) < eps = p1 | otherwise = go (not bl) p2 (i+2)

where p2 | bl = p1+1/i

| otherwise = p1-1/i

Oops, I've screwed up my timings earlier, that variant doesn't quite give optimal speed (3.65s via-C vs.3.02s for the other loops, but the NCG code of that takes a whopping 12.13s here vs. 4.x - 6.y s for the other loops), what I had measured was ==================================== calcPi :: Double -> Double calcPi eps = go False 1 3 where go True p1 i | 4*(p2-p1) < eps = p1 | otherwise = go False p2 (i+2) where p2 = p1+1/i go False p1 i | 4*(p1-p2) < eps = p1 | otherwise = go True p2 (i+2) where p2 = p1-1/i ==================================== (which gives 3.03s via C and 6.91s with the NCG). Mind trying that, too? Nevertheless, the results are seriously disturbing. The previous code runs more than 2.5 times as fast on your computer than on mine when compiled with the NCG and twice as fast on my computer than on yours when compiled via C. I don't know what to make of it.

...

2. F:\MeineUebungen>ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o c loop --make p085-pi_lualoop.hs [1 of 1] Compiling Main ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking cloop.exe ...

F:\MeineUebungen>ghc -O2 -fforce-recomp -o nloop --make p085-pi_lualoop.hs [1 of 1] Compiling Main ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking nloop.exe ...

F:\MeineUebungen>cloop +RTS -sstderr cloop +RTS -sstderr 3.1415926485894725 20,860 bytes allocated in the heap 892 bytes copied during GC 3,068 bytes maximum residency (1 sample(s)) 13,316 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.02s ( 0.02s elapsed) MUT time 7.39s ( 7.27s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.41s ( 7.28s elapsed)

%GC time 0.0% (0.0% elapsed)

Alloc rate 2,816 bytes per MUT second

Productivity 99.8% of total user, 101.5% of total elapsed

F:\MeineUebungen>nloop +RTS -sstderr nloop +RTS -sstderr 3.1415926485894725 20,860 bytes allocated in the heap 892 bytes copied during GC 3,068 bytes maximum residency (1 sample(s)) 13,316 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.02s ( 0.00s elapsed) MUT time 4.77s ( 4.81s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.78s ( 4.81s elapsed)

%GC time 0.0% (0.0% elapsed)

Alloc rate 4,362 bytes per MUT second

Productivity 99.7% of total user, 99.0% of total elapsed

-- Markus

Tim Perry

7:15 p.m.

CygWin would be a good option. Here are some other ideas: There is a timeit.exe utility in the Windows Server downloads which is supposed to be similar to unix's time command. Try the links here http://channel9.msdn.com/forums/Coffeehouse/258979-Windows-equivalent-of-Uni... Windows does support piping. It is similar to bash with 2>&1 to redirect stderr to stdout and stuff. http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-... The new windows shell is almost a real shell. Almost. If you don't have that, then this port of unix utilities makes an XP box tolerable. http://unxutils.sourceforge.net/ Hope that helps. --Tim ----- Original Message ---- From: Daniel Fischer To: Markus Böhm Cc: beginners@haskell.org Sent: Mon, February 1, 2010 6:40:52 AM Subject: Re: [Haskell-beginners] Performance of Idiomatic lazy Haskell Am Montag 01 Februar 2010 15:22:45 schrieb Markus Böhm:

...

Daniel, I use GHC 6.12.1 and Windows XP.

*sigh* Have you a CygWin installed? It should run as given from a CygWin shell, I believe.

...

The time command doesn't seem to work. It says in German: specified time can't be read. Give a new time.

Replace "time" with however you get the CPU time used by a process on Windows. But I think Windows doesn't allow piping, and I'm not sure whether it has echo, so see below.

...

Daniel, here another feedback:

F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make p 085-pi_lualoop.hs [1 of 1] Compiling Main ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking luaLoop.exe ...

F:\MeineUebungen>echo '0.00000001' | time ./luaLoop Eingegebene Zeit kann nicht übernommen werden. Geben Sie die neue Zeit ein:

F:\MeineUebungen>luaLoop +RTS -sstderr -RTS luaLoop +RTS -sstderr EPS: 0.00000001 3.1415926485894725 61,428 bytes allocated in the heap 1,316 bytes copied during GC 4,564 bytes maximum residency (1 sample(s)) 11,820 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.02s ( 0.00s elapsed) MUT time 8.06s ( 16.83s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.08s ( 16.83s elapsed)

%GC time 0.0% (0.0% elapsed)

Alloc rate 7,604 bytes per MUT second

Productivity 99.8% of total user, 47.9% of total elapsed

...

-- Markus

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Henk-Jan van Tuyl

11:05 p.m.

unxutils.zip seems to be gone; another option would be, to use MinGW/MSYS. You need this anyway for installing several Haskell packages that bind to C-software. If you start an MS-DOS shell and give command "sh", you are in a unixlike shell, where you can give the "time" command. Regards, Henk-Jan van Tuyl -- http://Van.Tuyl.eu/ http://members.chello.nl/hjgtuyl/tourdemonad.html -- On Mon, 01 Feb 2010 20:15:25 +0100, Tim Perry wrote:

...

CygWin would be a good option.

Here are some other ideas:

There is a timeit.exe utility in the Windows Server downloads which is supposed to be similar to unix's time command. Try the links here http://channel9.msdn.com/forums/Coffeehouse/258979-Windows-equivalent-of-Uni...

Windows does support piping. It is similar to bash with 2>&1 to redirect stderr to stdout and stuff. http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-...

The new windows shell is almost a real shell. Almost.

If you don't have that, then this port of unix utilities makes an XP box tolerable. http://unxutils.sourceforge.net/

Hope that helps. --Tim

----- Original Message ---- From: Daniel Fischer To: Markus Böhm Cc: beginners@haskell.org Sent: Mon, February 1, 2010 6:40:52 AM Subject: Re: [Haskell-beginners] Performance of Idiomatic lazy Haskell

Am Montag 01 Februar 2010 15:22:45 schrieb Markus Böhm:

...
Daniel, I use GHC 6.12.1 and Windows XP.

*sigh*

Have you a CygWin installed? It should run as given from a CygWin shell, I believe.

...
The time command doesn't seem to work. It says in German: specified time can't be read. Give a new time.

Replace "time" with however you get the CPU time used by a process on Windows. But I think Windows doesn't allow piping, and I'm not sure whether it has echo, so see below.

...
Daniel, here another feedback:

F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make p 085-pi_lualoop.hs [1 of 1] Compiling Main ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking luaLoop.exe ...

F:\MeineUebungen>echo '0.00000001' | time ./luaLoop Eingegebene Zeit kann nicht übernommen werden. Geben Sie die neue Zeit ein:

F:\MeineUebungen>luaLoop +RTS -sstderr -RTS luaLoop +RTS -sstderr EPS: 0.00000001 3.1415926485894725 61,428 bytes allocated in the heap 1,316 bytes copied during GC 4,564 bytes maximum residency (1 sample(s)) 11,820 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.02s ( 0.00s elapsed) MUT time 8.06s ( 16.83s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.08s ( 16.83s elapsed)

%GC time 0.0% (0.0% elapsed)

Alloc rate 7,604 bytes per MUT second

Productivity 99.8% of total user, 47.9% of total elapsed

What?????

Please try the following:

1) hardcode eps = 1e-8 in the source (or get it via getArgs and pass it as a command line argument) 2) ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o cloop -- make Source.hs 3) ghc -O2 -fforce-recomp -o nloop --make Source.hs 4) cloop +RTS -sstderr 5) nloop +RTS -sstderr

...
-- Markus

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Tim Perry

2 Feb 2 Feb

5:34 p.m.

The links are confusing, but unxutils is still available: http://sourceforge.net/projects/unxutils/ MinGW/MSYS may be a better and/or more fully supported package but I have no experience there. Anyhow, time is built into the unxutils "sh" shell: C:\Documents and Settings\perry>sh WREN# time sleep 3 sleep 3 0.00s user 0.00s system 0% cpu 3.125 total WREN# However, I'm pretty sure the shell is actually zsh, not sh. ----- Original Message ---- From: Henk-Jan van Tuyl To: Tim Perry ; Daniel Fischer ; Markus Böhm Cc: beginners@haskell.org Sent: Mon, February 1, 2010 3:05:58 PM Subject: Re: [Haskell-beginners] Performance of Idiomatic lazy Haskell unxutils.zip seems to be gone; another option would be, to use MinGW/MSYS. You need this anyway for installing several Haskell packages that bind to C-software. If you start an MS-DOS shell and give command "sh", you are in a unixlike shell, where you can give the "time" command. Regards, Henk-Jan van Tuyl -- http://Van.Tuyl.eu/ http://members.chello.nl/hjgtuyl/tourdemonad.html -- On Mon, 01 Feb 2010 20:15:25 +0100, Tim Perry wrote:

...

CygWin would be a good option.

Here are some other ideas:

There is a timeit.exe utility in the Windows Server downloads which is supposed to be similar to unix's time command. Try the links here http://channel9.msdn.com/forums/Coffeehouse/258979-Windows-equivalent-of-Uni...

Windows does support piping. It is similar to bash with 2>&1 to redirect stderr to stdout and stuff. http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-...

The new windows shell is almost a real shell. Almost.

If you don't have that, then this port of unix utilities makes an XP box tolerable. http://unxutils.sourceforge.net/

Hope that helps. --Tim

----- Original Message ---- From: Daniel Fischer To: Markus Böhm Cc: beginners@haskell.org Sent: Mon, February 1, 2010 6:40:52 AM Subject: Re: [Haskell-beginners] Performance of Idiomatic lazy Haskell

Am Montag 01 Februar 2010 15:22:45 schrieb Markus Böhm:

...
Daniel, I use GHC 6.12.1 and Windows XP.

*sigh*

Have you a CygWin installed? It should run as given from a CygWin shell, I believe.

...
The time command doesn't seem to work. It says in German: specified time can't be read. Give a new time.

Replace "time" with however you get the CPU time used by a process on Windows. But I think Windows doesn't allow piping, and I'm not sure whether it has echo, so see below.

...
Daniel, here another feedback:

F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make p 085-pi_lualoop.hs [1 of 1] Compiling Main ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking luaLoop.exe ...

F:\MeineUebungen>echo '0.00000001' | time ./luaLoop Eingegebene Zeit kann nicht übernommen werden. Geben Sie die neue Zeit ein:

F:\MeineUebungen>luaLoop +RTS -sstderr -RTS luaLoop +RTS -sstderr EPS: 0.00000001 3.1415926485894725 61,428 bytes allocated in the heap 1,316 bytes copied during GC 4,564 bytes maximum residency (1 sample(s)) 11,820 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.02s ( 0.00s elapsed) MUT time 8.06s ( 16.83s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.08s ( 16.83s elapsed)

%GC time 0.0% (0.0% elapsed)

Alloc rate 7,604 bytes per MUT second

Productivity 99.8% of total user, 47.9% of total elapsed

What?????

Please try the following:

1) hardcode eps = 1e-8 in the source (or get it via getArgs and pass it as a command line argument) 2) ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o cloop -- make Source.hs 3) ghc -O2 -fforce-recomp -o nloop --make Source.hs 4) cloop +RTS -sstderr 5) nloop +RTS -sstderr

...
-- Markus

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Daniel Fischer

31 Jan 31 Jan

6:10 p.m.

Am Sonntag 31 Januar 2010 15:52:41 schrieb Stephen Tetley:

...

Hi Daniel

Thanks - the figures are very impressive for the stream fusion library. I knew the paper, but I hadn't looked at it the implementation.

Just for the record, the loops (my variant) coded in C take 2.7 seconds (icc) resp. 3.0 seconds (gcc) with -O3. Markus' loop takes 5.32s (icc) resp. 6.04s (gcc) with -O3. Well, that's largely due to the fact that that takes two divisions per iteration (and divisions are very slow on my box). Rewriting it so that only one division per iteration is necessary: ================================================ module Main (main) where main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let pi14 = pisum 1 3 False (eps*0.25) putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) pisum s i b eps | d < eps = s | b = let h = s+d in h `seq` pisum h (i+2) False eps | otherwise = let h = s-d in h `seq` pisum h (i+2) True eps where d = 1/i ================================================ it takes 5.42 seconds (eps = 1e-8) when compiled natively, that loop in C is practically indistinguishable from my loop variant. So stream-fusion or hand-coded loops in Haskell are clearly slower than C loops, but don't stink when compiled with the native code generator. However, I get identical to gcc performance with ghc -O2 -fexcess-precision -fvia-C -optc-O3 [-optc-ffast-math doesn't make any difference then] for the hand-coded loops (all, my variant, Markus' original, Markus' rewritten - of course, each identical to the corresponding C-loop). Also for stream-fusion. Some of the list creating variants also profit from compilation via C, others not. However, none come near the NCG loop performance, let alone C loop performance. To sum up: Although GHC's native code generator does a not too bad job with arithmetic-heavy loops, gcc is better (even without optimisations). But if you compile via C, you can get identical performance - for these loops. In general, things are more complicated, often the native code generator produces better code than the via-C route. It's usually worth a test which produces the better binary, native or via-C.

5681

Age (days ago)

5683

Last active (days ago)

List overview

Download

23 comments

6 participants

participants (6)

Daniel Fischer
Felipe Lessa
Henk-Jan van Tuyl
Markus Böhm
Stephen Tetley
Tim Perry