tail-recursing through an associative list

In my first posting, I mentioned that I was going to try to translate some of our code to Haskell and see how it worked. Well, I don't have a stunning demonstration of the power of pure functional programming, but I do have an interesting problem. I chose to port a program that we use in our build system that takes a table of geographic data and groups the rows in the table according to [REDACTED]. The description of what the existing program does takes up only a few paragraphs, but the source code is eight pages of dense C++ that has obviously been optimized up the wazoo (and beyond the point where a mortal like myself can understand what's going on). On one of our servers, it can process 200,000 rows in about three minutes. My Haskell translation is three and a half pages of gorgeous lucid almost-entirely-functional code ... that in its first draft, took about three seconds to process 2,000 rows, eight minutes to process 20,000 rows, and overflowed a 1-MB stack when processing 200,000 rows. Oops. After banging together a version that took input from stdin instead of the database (it was easier to do that then to learn enough about Cabal to get HSQL recompiled with profiling), I profiled the code and observed that the most-called-upon function in the program (200 million entries for those 20,000 rows) was structured like this: type AList = [(Key, [MetaThingies])] myFunction :: AList -> Thingie -> AList myFunction [] x = [(key x, [meta x])] myFunction ((k, v):tail) x | matchKeys k (key x) = case tryJoining v x of Nothing -> (k, v) : (myFunction tail x) Just v' -> (k', v') : tail where v' = bestKey k (key x) | otherwise = (k, v) : (myFunction tail x) I'm wondering if the slowness of the program can be attributed to that case statement in the middle--perhaps unwrapping the Maybe type returned by tryJoining is preventing the function from being properly tail-recursive, or something. (I tried making the list construction strict by replacing "(k, v) : (myFunction tail x)" et al. with "(:) (k, v) $! (myFunction tail x)", and it actually slowed the program down, so either I'm not understanding how to improve things with strictness or laziness isn't the problem here.) Any advice from the gurus?

Seth Gordon
almost-entirely-functional code ... that in its first draft, took about three seconds to process 2,000 rows, eight minutes to process 20,000 rows, and overflowed a 1-MB stack when processing 200,000 rows. Oops.
Which just goes to show that your algorithm is non-linear in the size of the input. I doubt that your posted snippet is the cause of the problem, since it is certainly linear in the AList it is given.
I profiled the code and observed that the most-called-upon function in the program (200 million entries for those 20,000 rows)
By optimisation, you can only make this function a constant factor faster. You need to work out how to call it less often in the first place.
type AList = [(Key, [MetaThingies])]
myFunction :: AList -> Thingie -> AList myFunction [] x = [(key x, [meta x])] myFunction ((k, v):tail) x | matchKeys k (key x) = case tryJoining v x of Nothing -> (k, v) : (myFunction tail x) Just v' -> (k', v') : tail where v' = bestKey k (key x) | otherwise = (k, v) : (myFunction tail x)
myFunction is clearly rather like a map (except that occasionally it stops before traversing the whole list). There is nothing wrong with its recursion pattern or otherwise. Regards, Malcolm

Malcolm Wallace wrote:
Seth Gordon
wrote: almost-entirely-functional code ... that in its first draft, took about three seconds to process 2,000 rows, eight minutes to process 20,000 rows, and overflowed a 1-MB stack when processing 200,000 rows. Oops.
Which just goes to show that your algorithm is non-linear in the size of the input. I doubt that your posted snippet is the cause of the problem, since it is certainly linear in the AList it is given.
The linear time in the length of AList may well be the problem :) You seem to use AList as a key-value mapping (I know the word "associative" only as mathematical property, please correct me). The Key acts as key for the grouping you mentioned and the [MetaThingie] is the actual group of MetaThingie, is that so? That means that with each call to myFunction, the AList roughly grows by one element. For logarithmic access times, you should use a binary search tree like Data.Map or similar. The problem in your case could be that matchKeys is only approximate and your keys cannot be ordered in suitable fasion. Then you need a clever algorithm which somehow exploits extra structure of the keys (perhaps they are intervals, then you can use interval trees etc.). The C++ code is likely to do some magic along these lines. In such case, stunning functional power may come from Finger Trees: A Simple General-purpose Data Structure Ralf Hinze and Ross Paterson. in Journal of Functional Programming16:2 (2006), pages 197-217 http://www.soi.city.ac.uk/~ross/papers/FingerTree.pdf
I profiled the code and observed that the most-called-upon function in the program (200 million entries for those 20,000 rows)
By optimisation, you can only make this function a constant factor faster. You need to work out how to call it less often in the first place.
Almost true. I think that recursive calls are counted as proper call, so that each top level call to myFunction will result a count of calls to myFunction linear in the length of AList. Thus "in first place" alias top level is not enough.
type AList = [(Key, [MetaThingies])]
myFunction :: AList -> Thingie -> AList myFunction [] x = [(key x, [meta x])] myFunction ((k, v):tail) x | matchKeys k (key x) = case tryJoining v x of Nothing -> (k, v) : (myFunction tail x) Just v' -> (k', v') : tail where v' = bestKey k (key x) should be (?) where k' = bestKey k (key x) | otherwise = (k, v) : (myFunction tail x)
Regards, apfelmus

apfelmus@quantentunnel.de wrote:
For logarithmic access times, you should use a binary search tree like Data.Map or similar. The problem in your case could be that matchKeys is only approximate and your keys cannot be ordered in suitable fasion.
That is precisely the problem that I was dealing with. I've been ruminating on some way to use a map to solve the problem, although since this project has been demoted from "cool side project for work" to "cool side project for my Copious Free Time", I don't know when I'll be able to implement it. Thanks for all your suggestions--at least I feel like I am on the right track.

On 10/12/06, Seth Gordon
(it was easier to do that then to learn enough about Cabal to get HSQL recompiled with profiling),
Hmm...That should just require: runghc Setup.hs configure -p Doing that tells cabal to build both the normal library and profiling enabled copy so that when you run: runghc Setup.hs build runghc Setup.hs install Both versions (normal and profiled) should be installed side by side. If that didn't work for you, maybe there is some other problem. I actually compiled HSQL from scratch + HSQL-MySQL just two days ago without a problem using the above commands. Perhaps you need to add a line in the .cabal file to tell ghc which profiling options to use. Something like: ghc-prof-options: -caf-all On a side note, cabal is fairly simple still (from a user point of view), meaning it's easy to learn 80% of the functionality (or maybe I mean learn how to do 80% of what you need). I think what we lack (or what the cabal userguide lacks) is a section on concrete examples taking you from cabalizing a HelloWorld project on up through the work needed to do fancy cross platform stuff. I know there are some examples, but something isn't quite right yet to make it accessible enough for beginners. And I'm not sure what, I just have some fuzzy ideas on what might make it better :) For me once I got comfortable with cabal I found that darcs + cabal makes a mean team. You can quickly pull together different libraries and get some serious hacking done. Much praise to both. I can't wait till cabal-install/hackage is mainstream. HTH, Jason
participants (4)
-
apfelmus@quantentunnel.de
-
Jason Dagit
-
Malcolm Wallace
-
Seth Gordon