
Thanks for letting me know about the Data.Strict library on Hackage. I will definitely make use of that! BTW, you left out an import Data.List(foldl') in your example. My timing test is an order of magnitude worse than yours. Do you have an extra zero in your list endpoint?
I fed these functions to ghc with the -O2 and -threaded flags and timed them using the list [1..10000000]. The result (best times out of several runs): avg4: 284 ms avgS: 184 ms avgP: 248 ms
Using ghc -threaded -O2 --make Avg.hs for ghc 6.6.1, I ran your tests on [1..10000000] and got the user times: avg4: 12.75 s avgS: 3.65 s avgP: 15.56 s The funny thing is that avg4/avgS = 3.5 for and only 1.5 for you. I understand that with only 1 processor my avgP time may be twice yours, but not the avgS or avg4. I have the following machine: Main memory size: 2026 Mbytes Num Processors: 1 Processor Type: Intel(R) Xeon(TM) CPU 2.80GHz x32 Clock Speed: 2790 MHZ Josef Svenningsson wrote:
Sorry for reacting so late on this mail. I'm digging through some old mails...
On 10/12/07, Dan Weston
wrote: Always check optimizations to make sure they are not pessimizations!
Actually, traversing the list twice is very cheap compared to space leakage, and accumulating pairs requires tuple boxing and unboxing which I don't know how to get GHC not to do.
I agree hole-heartedly that replacing multiple traversals with a single traversal should be done with care as it more often than not results in a pessimization. Indeed you showed just that with your examples! But I'd thought it'd be interesting to see how it can actually be an improvement if done carefully.
\begin{code} import Control.Arrow import qualified Data.Strict.Tuple as T import Data.Strict.Tuple (Pair(..)) import Control.Parallel
avg4 = uncurry (/) . (foldl' (+) 0 &&& foldl' (\x y -> x + 1) 0) avgS = T.uncurry (/) . foldl' (\p n -> ((+n) *!* (+1)) p) (0 :!: 0) avgP = uncurry (/) . (foldl' (+) 0 &!& foldl' (\x y -> x + 1) 0)
(*!*) f g (a :!: b) = f a :!: g b
(&!&) f g a = fa `par` (fa,ga) where fa = f a ga = g a \end{code}
avg4 is the function which was best among Dan's benchmarks. avgS uses strict tuples. I just threw in avgP for fun, it traverses the lists in parallel. Note: I do have a dual core machine so it makes sense to try avgP.
I fed these functions to ghc with the -O2 and -threaded flags and timed them using the list [1..10000000]. The result (best times out of several runs): avg4: 284 ms avgS: 184 ms avgP: 248 ms
It seems doing a single traversal can be faster if your write your function carefully. Doing the traversal in parallel was beneficial but not as good as the single traversal.
Cheers,
/Josef