
Hi! I am making a simple raycasting engine and have a function which take a point in space and return a color of an object (if there is any) at this point in space. And because the whole thing is really slow (or was really slow) on simple examples I decided to profile it. It takes around 60 seconds for a 640x480 px image with 400 depth of field. This is at worst 122,880,000 calculations (if the scene is rather empty) of a coordinate of a point in space and then checking for a color. And 60 seconds look really a lot to me for that. So I went profiling and found out that the strange part of code is the main color checking function which has a list of objects (at this time the list is hardcoded). It looks like this: world :: SpacePoint -> VoxelColor world point = case msum . sequence elements $ point of Just v -> v Nothing -> noColor where elements = [redSphere (0,50,0) 50, greenSphere (25,-50,0) 50, blueSphere (-150,0,150) 50] So three spheres in a world and I check if the point is in any of them. Like that: sphere :: SpacePoint -> BasicReal -> VoxelColor -> WorldElement -- center of sphere, it's radius, it's color sphere (x0,y0,z0) r color (x,y,z) | x' * x' + y' * y' + z' * z' <= r * r = Just color | otherwise = Nothing where x' = x - x0 y' = y - y0 z' = z - z0 redSphere :: SpacePoint -> BasicReal -> WorldElement redSphere c r = sphere c r redColor So profiling told me that world function takes 38.4 % of all running time. So I decided to play with it. Maybe a more direct approach would be better: world :: SpacePoint -> VoxelColor world point = findColor [redSphere (0,50,0) 50, greenSphere (25,-50,0) 50, blueSphere (-150,0,150) 50] where findColor [] = noColor findColor (f:fs) = case f point of Just v -> v Nothing -> findColor fs Great, it improved. To 40 s. But still it was too much. I tried this: world :: SpacePoint -> VoxelColor world point = case redSphere (0,50,0) 50 point of Just v -> v Nothing -> case greenSphere (25,-50,0) 50 point of Just v -> v Nothing -> case blueSphere (-150,0,150) 50 point of Just v -> v Nothing -> noColor And it took 15 s. And also the profiling was like I would anticipate. Calculating points coordinates and checking spheres takes almost all time. So any suggestions how could I build a list of objects to check at runtime and still have this third performance? Why this big difference? (I am using GHC 6.8.3 with -O2 compile switch.) (The <* operator is casting a ray, that is multiplying a ray direction vector with a scalar factor.) Mitar

2008/7/9 Mitar
And it took 15 s. And also the profiling was like I would anticipate. Calculating points coordinates and checking spheres takes almost all time.
So any suggestions how could I build a list of objects to check at runtime and still have this third performance? Why this big difference?
I think the speed difference really comes from using a list to hold the spheres in your first two examples, to referring to them directly in case statements in your last example. Lists are going to introduce indirections and therefore are slower than referring directly to the values themselves. Maybe you would have better luck using arrays? Template Haskell is also an option - if you want to "hard code" your scene in another module, TH can turn it into that kind of case statement. Of course, as the scenes get more complex a series of nested cases isn't going to be too effecient. Justin

Hi! This is not all. If I compare performance of those two semantically same functions: castRayScene1 :: Ray -> ViewportDotColor castRayScene1 (Ray vd o d) = ViewportDotColor vd (castRay' noColor 0) where castRay' color@(VoxelColor _ _ _ alpha) depth | depth > depthOfField = color | alpha < 0.001 = castRay' pointColor (depth + distance') | alpha > 0.999 = color | otherwise = castRay' newColor (depth + distance') where (# pointColor, distance #) = worldScene (o <+> (d <* depth)) distance' = max 1 distance newColor = addColor color pointColor and: castRay :: World -> Ray -> ViewportDotColor castRay world (Ray vd o d) = ViewportDotColor vd (castRay' noColor 0) where castRay' color@(VoxelColor _ _ _ alpha) depth | depth > depthOfField = color | alpha < 0.001 = castRay' pointColor (depth + distance') | alpha > 0.999 = color | otherwise = castRay' newColor (depth + distance') where (# pointColor, distance #) = world (o <+> (d <* depth)) distance' = max 1 distance newColor = addColor color pointColor castRayScene2 :: Ray -> ViewportDotColor castRayScene2 = castRay worldScene is the program which uses castRayScene1 1.35 faster then the program which uses castRayScene2 (37 seconds versus 50 seconds). (Compiler with GHC 6.8.3 and -O2 switch. Program is executing almost just this function over and over again.) It is somehow award that passing function as an argument slow down the program so much. Is not Haskell a functional language and this such (functional) code reuse is one of its main points? Of course. I could use some preprocessor/template engine to change/find/replace castRay-like function into a castRayScene1 before compiling. But this somehow kills the idea of a compiler? Smart compiler. Which should do things for you? The same as my previous example. Where a hard-coded list was not optimized. (Like it would change during the execution.) It looks like my program would be interpreted and not compiled. Mitar

On 2008 Jul 11, at 19:46, Mitar wrote:
It is somehow award that passing function as an argument slow down the program so much. Is not Haskell a functional language and this such (functional) code reuse is one of its main points?
That is in fact the case; GHC's version of various Prelude functions refactors them to avoid passing functional arguments. IIRC the problem is that, while Haskell is indeed functional, passing a polymorphic function as an argument causes the runtime to have to look up which type is needed for every call, whereas if it's factored out it can be computed only once and (implicitly) let-bound. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

It is somehow award that passing function as an argument slow down the program so much. Is not Haskell a functional language and this such (functional) code reuse is one of its main points?
I can think of a few reasons why function passing is slow: * There is an overhead to closure creation (I don't know how big this is in practice, but it can be significant) * GHC has little information about what function-typed variables mean, because the function they are bound to is not known until runtime. This precludes the use of inlining, rewrite rules etc, which are absolutely key factors in making Haskell fast Regarding your first example: currently GHC does not do loop unrolling. It probably should though because loop unrolling reduces braches and increases the amount of information about the execution path that is available statically (as in e.g. the case liberation transformation), which probably explains the increased performance you saw by doing it manually in your first email. Earlier this year I implemented something somewhat similar though: fusable list literals. In this example from your first email: world :: SpacePoint -> VoxelColor world point = findColor [redSphere (0,50,0) 50, greenSphere (25,-50,0) 50, blueSphere (-150,0,150) 50] where findColor [] = noColor findColor (f:fs) = case f point of Just v -> v Nothing -> findColor fs If findColor had been a function defined in terms of foldr rather than using explicit recursion, then theres a good chance GHC 6.9 would have fused it with the list to yield your optimized, loop unrolled, version: world :: SpacePoint -> VoxelColor world point = case redSphere (0,50,0) 50 point of Just v -> v Nothing -> case greenSphere (25,-50,0) 50 point of Just v -> v Nothing -> case blueSphere (-150,0,150) 50 point of Just v -> v Nothing -> noColor Incidentally, if in your most recent email castRayScene2 was your only used of castRay, GHC would have inlined the whole definition into that use site and you would have got castRayScene1 back again. So, GHC does try its best to make higher order functions fast :-). But it's quite tricky! All the best, Max

Hi! (I will reply propely later, I have a project to finish and GHC is playing me around and does not want to cooperate.) This project of mine is getting really interesting. Is like playing table tennis with GHC. Some time it gives a nice ball, sometimes I have to run around after it. But I just wanted to make a simple raycasting engine. Not debug GHC. But it is interesting - just I do not have time just know for playing the change code - compile - run on known input - check if time elapsed increased (I have to do this even when I am thinking that I am optimizing things or even when I am thinking that I am just refactoring code - moving constants to definitions...). And this is a slow process because every iteration runs for a few minutes. The next beautiful example in this series is this function for computing 4D Julia set fractal: julia4DFractal :: BasicReal -> World julia4DFractal param (x,y,z) = julia4D (Q (x / scale) (y / scale) (z / scale) param) iterations where c = (Q (-0.08) 0.0 (-0.8) (-0.03)) alphaBlue = VoxelColor 0 0 (2 / scale) (2 / scale) scale = fromIntegral sceneHeight / 1.8 threshold = 16 iterations = 100 :: Int julia4D _ 0 = (# alphaBlue, 1 #) -- point is (probably) not in the set julia4D q it | qMagnitudeSquared q > threshold = (# noColor, 1 #) -- point is in the set | otherwise = julia4D (qSquared q + c) (it - 1) where distance = scale * (qMagnitude qN) / (2 * (qMagnitude qN')) * log (qMagnitude qN) (# qN, qN' #) = disIter q (Q 1 0 0 0) iterations where disIter qn qn' 0 = (# qn, qn' #) disIter qn qn' i | qMagnitudeSquared qn > threshold = (# qn, qn' #) | otherwise = disIter (qSquared qn + c) (2 * qn * qn') (i - 1) Please observe that distance is never used. And this is also what GHC warns. But the trick is that with having this part of a code in there, the program virtually never finishes (I killed it after 15 minutes). If I remove everything on and after the "where distance" line it finishes in 30 seconds. OK, the problem is with (# qN, qN' #), if this is changed to normal (qN, qN'), then it works. But to notice this ... This is something you have to spend a day for. Mitar

On Sat, Jul 12, 2008 at 8:57 PM, Mitar
julia4DFractal :: BasicReal -> World julia4DFractal param (x,y,z) = julia4D (Q (x / scale) (y / scale) (z / scale) param) iterations where c = (Q (-0.08) 0.0 (-0.8) (-0.03)) alphaBlue = VoxelColor 0 0 (2 / scale) (2 / scale) scale = fromIntegral sceneHeight / 1.8 threshold = 16 iterations = 100 :: Int julia4D _ 0 = (# alphaBlue, 1 #) -- point is (probably) not in the set julia4D q it | qMagnitudeSquared q > threshold = (# noColor, 1 #) -- point is in the set | otherwise = julia4D (qSquared q + c) (it - 1) where distance = scale * (qMagnitude qN) / (2 * (qMagnitude qN')) * log (qMagnitude qN) (# qN, qN' #) = disIter q (Q 1 0 0 0) iterations where disIter qn qn' 0 = (# qn, qn' #) disIter qn qn' i | qMagnitudeSquared qn > threshold = (# qn, qn' #) | otherwise = disIter (qSquared qn + c) (2 * qn * qn') (i - 1)
Please observe that distance is never used. And this is also what GHC warns. But the trick is that with having this part of a code in there, the program virtually never finishes (I killed it after 15 minutes). If I remove everything on and after the "where distance" line it finishes in 30 seconds. OK, the problem is with (# qN, qN' #), if this is changed to normal (qN, qN'), then it works. But to notice this ... This is something you have to spend a day for.
My guess is that it was premature optimization that created this bug. Unboxed tuples are not the best answer for every situation. They are evaluated strictly! Which means: unboxedBottom x | False = (# 0, 0 #) | otherwise = unboxedBottom x let (# x, y #) = unboxedBottom 0 in 42 Is an infinite loop, not 42 as you would expect. So when you write: where (# ... #) = something You are requiring your program to evaluate 'something' regardless of whether it is needed. Unboxed tuples should be taken in the same vain as explicit strictness annotations: almost never use them, and let GHC do the work for you. If you are in that phase where you are doing performance tweaks and you think GHC's strictness analysis might not be picking up on some strict behavior in your program, add the annotation. If it makes it faster, great; if it doesn't change things, take it out! Best to underconstrain your program. But these days I try to make my programs fast by making the structure of my program apparent to the compiler, not by forcing it to do things in a certain way. Admittedly making the structure of a program apparent to the compiler is a rather subtle and brittle process. I'm sure people have at least brainstormed ways to help the compiler more. Luke

Hi!
My guess is that it was premature optimization that created this bug.
It is the root of all evil. ;-)
Unboxed tuples are not the best answer for every situation. They are evaluated strictly!
Then I have not understood the last paragraph correctly: http://www.haskell.org/ghc/docs/latest/html/users_guide/primitives.html Oh, no. It is like you say. I also use -funbox-strict-fields and Q is defined with strict fields. But I tried also without the switch and is it the same (it takes forever). But then qN and qN' does not have unboxed types. So it should be lazy?
If you are in that phase where you are doing performance tweaks and you think GHC's strictness analysis might not be picking up on some strict behavior in your program, add the annotation. If it makes it faster, great; if it doesn't change things, take it out! Best to underconstrain your program.
I completely agree. I am also a firm believer in the clean and pure code where I would leave all optimization to compiler and just write semantics into a program. But this project just showed me that there is still a long way of compiler development before that would be possible (and usable). That some simple refactoring of code which is not really changing semantics have a big influence on a performance because compiler uses it differently (polymorphic types instead of hardcoded types, passing function as an parameter instead of hardcode it). For example I have now defined my types as: type BasicReal = Double data Quaternion = Q !BasicReal !BasicReal !BasicReal !BasicReal deriving (Eq,Show) So that I can easily change the type everywhere. But it would be much nicer to write: data Quaternion a = Q !a !a !a !a deriving (Eq,Show) Only the performance of Num instance functions of Quaternion is then quite worse. Mitar

2008/7/12 Mitar
So that I can easily change the type everywhere. But it would be much nicer to write:
data Quaternion a = Q !a !a !a !a deriving (Eq,Show)
Only the performance of Num instance functions of Quaternion is then quite worse.
You can probably use a specialization pragma to get around that. -- Jedaï

Hi!
On Fri, Jul 18, 2008 at 3:54 PM, Chaddaï Fouché
So that I can easily change the type everywhere. But it would be much nicer to write:
data Quaternion a = Q !a !a !a !a deriving (Eq,Show)
Only the performance of Num instance functions of Quaternion is then quite worse.
You can probably use a specialization pragma to get around that.
But why is this not automatic? If I use Quaternions of only one type in the whole program then why it does not make specialized version for it? At least with -O2 switch. Why exactly are polymorphic functions slower? Is not this just a question of type checking (and of writing portable/reusable code)? But later on in a compiler process we do know of which type exactly is the called function - so we could use a function as it would be written only for that type. Something what specialization is doing as I see. I thought this is always done. Mitar

Mitar wrote:
On Fri, Jul 18, 2008 at 3:54 PM, Chaddaï Fouché
wrote: So that I can easily change the type everywhere. But it would be much nicer to write:
data Quaternion a = Q !a !a !a !a deriving (Eq,Show)
Only the performance of Num instance functions of Quaternion is then quite worse.
You can probably use a specialization pragma to get around that.
But why is this not automatic? If I use Quaternions of only one type in the whole program then why it does not make specialized version for it? At least with -O2 switch.
You could try jhc: it does whole program optimization. Ghc compiles each module separately. Cheers Ben

ben.franksen:
Mitar wrote:
On Fri, Jul 18, 2008 at 3:54 PM, Chaddaï Fouché
wrote: So that I can easily change the type everywhere. But it would be much nicer to write:
data Quaternion a = Q !a !a !a !a deriving (Eq,Show)
Only the performance of Num instance functions of Quaternion is then quite worse.
You can probably use a specialization pragma to get around that.
But why is this not automatic? If I use Quaternions of only one type in the whole program then why it does not make specialized version for it? At least with -O2 switch.
You could try jhc: it does whole program optimization. Ghc compiles each module separately.
No need to switch compilers. GHC is able to do a pretty good job. Consider, data Q a = Q !a !a !a !a deriving (Eq,Show) -- yeah, polymorphic go :: Num a => Q a -> Q a go (Q 0 0 0 0) = Q 1 2 3 4 go (Q a b c d) = go $! Q (a * a) (b * b) (c * c) (d * d) -- ah, but we fix it. type QI = Q Int -- and try that: main = print (go (Q 2 3 7 13 :: QI)) GHC specialises and gives us, $wgo :: Int# -> Int# -> Int# -> Int# -> Q Int So just use ghc-core to check what you're getting. -- Don

Hi!
On Sat, Jul 12, 2008 at 3:33 AM, Max Bolingbroke
If findColor had been a function defined in terms of foldr rather than using explicit recursion, then theres a good chance GHC 6.9 would have fused it with the list to yield your optimized, loop unrolled, version:
My first version was with msum. Is this also covered by this fusion? (And it is interesting that my own recursion version is faster than the version with msum. Why?)
Incidentally, if in your most recent email castRayScene2 was your only used of castRay, GHC would have inlined the whole definition into that use site and you would have got castRayScene1 back again.
It is a little more tricky. I choose in an IO monad which scene it will render (selected by a program argument). So at compile time it does not yet know which one it will use. But there is a finite number of possibilities (in my case two) - why not inline both versions and at run time choose one? Mitar

2008/7/18 Mitar
On Sat, Jul 12, 2008 at 3:33 AM, Max Bolingbroke
wrote: If findColor had been a function defined in terms of foldr rather than using explicit recursion, then theres a good chance GHC 6.9 would have fused it with the list to yield your optimized, loop unrolled, version:
My first version was with msum. Is this also covered by this fusion?
Note that as I said the fusion only applies to 6.9 onwards. However, assuming that you were using msum at the list monad then since msum = concat is defined in terms of foldr there is a chance it could happen. The best guide to this kind of thing is not asking me but rather looking at the Core output by GHC for your particular program.
(And it is interesting that my own recursion version is faster than the version with msum. Why?)
I don't know why you find this suprising :-). Your own version is specialized exactly for the situation you wish to use it for. msum is a generic combinator, which naturally makes it less amenable to optimization because there is less information available about it's usage pattern. Note that GHC will of course try to do its best to de-specialize the msum function for any particular scenario it's used in, through inlining etc, but there are no guarantees.
It is a little more tricky. I choose in an IO monad which scene it will render (selected by a program argument). So at compile time it does not yet know which one it will use. But there is a finite number of possibilities (in my case two) - why not inline both versions and at run time choose one?
If there was only one static occurance of each identifier in your module then it would do that. This is because there is no code size implications for inlining a function into it's unitary use site. Inlining is not a cure-all for performance though: if you inline too much then you increase code size and hence increase the amount of main memory you're reading and reduce instruction cache hits, not to mention fill up your disk with multi-MB binaries. Max

Hi! I had to change code somewhat. Now I have a function like: worldScene point = case redSphere (0,50,0) 50 point of v@(Right _) -> v Left d1 -> case greenSphere (25,-250,0) 50 point of v@(Right _) -> v Left d2 -> Left $ d1 `min` d2 (Of course there could be more objects.) Any suggestion how could I make this less hard-coded? Something which would take a list of objects (object functions) and then return a Right if any object function return Right or a minimum value of all Lefts. But that it would have similar performance? If not on my GHC version (6.8.3) on something newer (which uses fusion or something). Is there some standard function for this or should I write my own recursive function to run over a list of object functions? But I am afraid that this will be hard to optimize for compiler. (It is important to notice that the order of executing object functions is not important so it could be a good candidate for parallelism.) Mitar

I had similar experiences as you when attempting to write "high
performance Haskell"; the language makes you want to use high-level
abstracted functions but the optimizer (while amazing, to be honest)
seems to miss a few cases that it seems like it should hit.
The problem seems to be that the compiler is extremely good at
optimizing systems-level code, but that any "control-structure"
function needs to be extremely inlined to be successful. You might
try re-writing "sequence" or "foldM" a few different ways using the
same "test" function and see what you can get.
One thing that I notice about this code is that if you switch Right
and Left you will get the default behavior for the Either monad:
worldSceneSwitch point = case redSphere (0,50,0) 50 point of
v@(Left _) -> v
Right d1 -> case greenSphere (25,-250,0) 50 point of
v@(Left _) -> v
Right d2 -> Right $ d1 `min` d2
is the same as
worldSceneSwitch point = do
d1 <- redSphere (0, 50, 0) 50 point
d2 <- greenSphere (25, -250, 0) 50 point
Right (d1 `min` d2)
Here is the same concept using foldM:
minimumM (x : xs) = do
v0 <- x
foldM (return . min) v0 xs
worldSceneSwitch point = minimumM [
redSphere (0, 50, 0) 50 point,
greenSphere (25, -250, 0) 50 point
]
However, the performance here will be terrible if minimumM and foldM
do not get sufficiently inlined; you do not want to be allocating list
thunks just to execute them shortly thereafter.
-- ryan
On Sat, Jul 19, 2008 at 6:48 AM, Mitar
Hi!
I had to change code somewhat. Now I have a function like:
worldScene point = case redSphere (0,50,0) 50 point of v@(Right _) -> v Left d1 -> case greenSphere (25,-250,0) 50 point of v@(Right _) -> v Left d2 -> Left $ d1 `min` d2
(Of course there could be more objects.)
Any suggestion how could I make this less hard-coded? Something which would take a list of objects (object functions) and then return a Right if any object function return Right or a minimum value of all Lefts. But that it would have similar performance? If not on my GHC version (6.8.3) on something newer (which uses fusion or something). Is there some standard function for this or should I write my own recursive function to run over a list of object functions? But I am afraid that this will be hard to optimize for compiler.
(It is important to notice that the order of executing object functions is not important so it could be a good candidate for parallelism.)
Mitar _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

| I had similar experiences as you when attempting to write "high | performance Haskell"; the language makes you want to use high-level | abstracted functions but the optimizer (while amazing, to be honest) | seems to miss a few cases that it seems like it should hit. Ryan, if you find any of these, do please submit them to GHC's Trac bug-tracker. There's a special category for performance bugs. Small programs of the form "GHC should hit this, but doesn't" are incredibly useful. Thanks Simon

Done. http://hackage.haskell.org/trac/ghc/ticket/2465
-- ryan
On 7/23/08, Simon Peyton-Jones
| I had similar experiences as you when attempting to write "high | performance Haskell"; the language makes you want to use high-level | abstracted functions but the optimizer (while amazing, to be honest) | seems to miss a few cases that it seems like it should hit.
Ryan, if you find any of these, do please submit them to GHC's Trac bug-tracker. There's a special category for performance bugs. Small programs of the form "GHC should hit this, but doesn't" are incredibly useful.
Thanks
Simon
participants (10)
-
Ben Franksen
-
Brandon S. Allbery KF8NH
-
Chaddaï Fouché
-
Don Stewart
-
Justin Bailey
-
Luke Palmer
-
Max Bolingbroke
-
Mitar
-
Ryan Ingram
-
Simon Peyton-Jones