
I've finally gotten enough round tuits to learn Haskell, and now that I've done some of the exercises from _The Haskell School of Expression_ and I finally (think I) understand what a monad is, the language is making a lot more sense to me (although my code is not always making so much sense to the compiler :-). My employer (MetaCarta) makes a search engine that can recognize geographic data. My group within MetaCarta is responsible for building the "Geographic Data Module" within our software. To do this, we slurp a heap of geographic and linguistic data from a variety of sources, normalize it, and then use some algorithms (that I'm not allowed to describe) to generate the module. This seems like the sort of task that cries out for a functional-programming approach, and that's what we use, sorta: a lot of the code that I'm responsible for is SQL, with chains of "CREATE TEMP TABLE X AS [insert very complicated query here]", some C++ for the parts that would be very time-consuming or impossible to implement in SQL, and shell scripts to tie everything together. I told my tech lead that I want to try porting some of this code to Haskell in the hope that it would run faster and/or be easier to read. He said I should spend two work days on the project and then be prepared to convince my co-workers that further research in this vein is (or is not) worth doing. So before I embark on day 1 of the project, I thought I should check and see if anyone on this list has used Haskell to munge a ten-million-row database table, and if there are any particular gotchas I should watch out for. adTHANKSvance....

Hello,
So before I embark on day 1 of the project, I thought I should check and see if anyone on this list has used Haskell to munge a ten-million-row database table, and if there are any particular gotchas I should watch out for.
One immediate thing to be careful about is how you do IO. Haskell is not very good, in my experience, at reading files fast. You'll probably want to skip the standard Haskell IO functions and use the lazy bytestring library (http://www.cse.unsw.edu.au/~dons/fps.html). Another thing to be careful about is laziness. I suspect it will be very easy to write code that does what you want but overflows your heap space due to delaying the computation on each row until after the entire file is read and the result of the complete computation is needed. More information on this is available at: http://haskell.org/haskellwiki/Performance. good luck, Jeff

jeff p wrote:
Hello,
So before I embark on day 1 of the project, I thought I should check and see if anyone on this list has used Haskell to munge a ten-million-row database table, and if there are any particular gotchas I should watch out for.
One immediate thing to be careful about is how you do IO. Haskell is not very good, in my experience, at reading files fast. You'll probably want to skip the standard Haskell IO functions and use the lazy bytestring library (http://www.cse.unsw.edu.au/~dons/fps.html).
I'm planning to use HSQL, since it's in Debian stable and the API resembles what I'm already familiar with. Database access is slower than file access (which is one reason I want to move as much logic as I can out of SQL), so if the speed of getting rows out of the database turns out to be the bottleneck in my code, I'll either be happy that all the other code is so efficient or peeved that HSQL is so inefficient.

On 10/1/06, Seth Gordon
I'm planning to use HSQL, since it's in Debian stable and the API resembles what I'm already familiar with. Database access is slower than file access (which is one reason I want to move as much logic as I can out of SQL), so if the speed of getting rows out of the database turns out to be the bottleneck in my code, I'll either be happy that all the other code is so efficient or peeved that HSQL is so inefficient.
Hi Seth, HSQL is just a thin wrapper around the underlying database engine. The performance in this case depends mainly on the database engine that you are using. Cheers, Krasimir
participants (3)
-
jeff p
-
Krasimir Angelov
-
Seth Gordon