
Hello Simon, Wednesday, January 18, 2006, 5:31:25 PM, you wrote:
2) generating random values takes about 1.5-2 seconds by itself. Haskell's RNG is very different from C's one
SM> I squeezed a bit more out (see attached).
x `seq` v `seq` return ()
it's new trick for me :) now the difference is less than 3x btw, i also use "return $! length xs" trick to ensure that all xs elements will be evaluated
for from to action | from `seq` to `seq` False = undefined
and this changes nothing, at least with 6.4.1/mingw32 btw, using "mapM_ action [n..m]" is very common operation. can it be automatically substituted with my code by using some RULE pragmas in ghc libraries? that will automatically improve many ghc-compiled programs too, i use the following code instead of replicateM: myReplicateM n action = if (n<=5*10^4) then sequence (replicate n action) else goLarge n [] >>= return.reverse where goLarge 0 xs = return xs goLarge n xs = do x <- action (goLarge $! n-1) $! x:xs it doesn't overflow stack and works much faster for the large n. that is my testbed for this function: import Control.Monad main = do a <- replicateM 1 $ myReplicateM (1*10^6) (return 1) return $! sum (map last a) and also, how about adding to GHC strictness annotations? x <- newArray (0,nelems-1) 0 :: IO !Vector v <- newArray_ (0,nelems-1) :: IO !Vector for :: !Int -> !Int -> (!Int -> IO a) -> IO () it's SO common source of performance problems... SM> I think the main bottleneck SM> is now the random number generator, in particular it is supplying boxed SM> Doubles which have to be unboxed again before storing in the array. as i say, it uses 1.5-2 seconds, i.e. only 10% of time when you run 1000 iterations (may be you not noticed that it used only in initialization?). so, while RNG itself runs 150 times slower (!), it doesn't make so much difference when you run 1000 iterations after initial filling the array and about "using Altivec instructions". the code produced for new.hs contains only one `fadd` operation, so it is easy to find entire cycle as it is compiled by GHC. that is one: movl (%ebp), %eax cmpl 12(%esi), %eax jge L81 movl 8(%esi), %edx leal 8(%edx,%eax,8), %eax movl (%eax), %edx movl %edx, 16(%esp) movl 4(%eax), %eax movl %eax, 20(%esp) fldl 16(%esp) fstpl 24(%esp) fldl 24(%esp) fstpl 48(%esp) movl (%ebp), %eax movl 4(%esi), %edx leal 8(%edx,%eax,8), %eax movl (%eax), %edx movl %edx, 8(%esp) movl 4(%eax), %eax movl %eax, 12(%esp) fldl 8(%esp) fstpl 24(%esp) fldl 24(%esp) fstpl 40(%esp) fldl 48(%esp) faddl 40(%esp) fstpl 32(%esp) movl (%ebp), %ecx movl 8(%esi), %eax leal 8(%eax,%ecx,8), %ecx fldl 32(%esp) fstpl 24(%esp) movl 24(%esp), %eax movl 28(%esp), %edx movl %eax, (%ecx) movl %edx, 4(%ecx) incl (%ebp) movl $_s3IY_info, %eax L85: jmp *%eax L81: good work, yes? ;-) the C source is also amateur :) IF_(s3IY_entry) { W_ _c3MF; StgDouble _s3IP; StgDouble _s3IQ; StgDouble _s3IS; W_ _s3IW; FB_ _c3MF = (I_)(*Sp) >= (I_)(R1.p[3]); if (_c3MF >= 0x1U) goto _c3MI; _s3IP = PK_DBL((P_)(((R1.p[2]) + 0x8U) + ((*Sp) << 0x3U))); _s3IQ = PK_DBL((P_)(((R1.p[1]) + 0x8U) + ((*Sp) << 0x3U))); _s3IS = _s3IP + _s3IQ; ASSIGN_DBL((P_)(((R1.p[2]) + 0x8U) + ((*Sp) << 0x3U)),_s3IS); _s3IW = (*Sp) + 0x1U; *Sp = _s3IW; JMP_((W_)&s3IY_info); _c3MI: R1.p = (P_)(W_)&GHCziBase_Z0T_closure; Sp=Sp+1; JMP_(*Sp); FE_ } the only cause that this code is only 3 times slower is that C version is really limited by memory speed. when tested on 1000-element arrays, it is 20 times slower. i'm not yet tried SSE optimization for gcc ;) -- Best regards, Bulat mailto:bulatz@HotPOP.com