
I've done some stuff with maybe 50k rows at a time. A few bits and pieces: 1: I've used HSQL (http://sourceforge.net/project/showfiles.php?group_id=65248) to talk to ODBC databases. Works fine, but possibly a bit slowly. I'm not sure where the delay is: it might just be the network I was running it over. One gotcha: the field function takes a field name, but its not random access. Access the fields in query order or it crashes. 2: For large data sets laziness is your friend. When reading files "getContents" presents an entire file as a list, but its really evaluated lazily. This is implemented using unsafeInterleaveIO. I've never used this, but in theory you should be able to set up a query that returns the entire database as a list and then step through it using lazy evaluation in the same way. 3: You don't say whether these algorithms are just row-by-row algorithms or whether there is something more sophisticated going on. Either way, try to make things into lists and then apply map, fold and filter operations. Its much more declarative and high level when you do it that way. Let us know how you get on. Paul.

Paul Johnson wrote:
I've done some stuff with maybe 50k rows at a time. A few bits and pieces:
1: I've used HSQL (http://sourceforge.net/project/showfiles.php?group_id=65248) to talk to ODBC databases. Works fine, but possibly a bit slowly. I'm not sure where the delay is: it might just be the network I was running it over. One gotcha: the field function takes a field name, but its not random access. Access the fields in query order or it crashes.
Thanks; that's certainly the sort of thing I like knowing in advance.
2: For large data sets laziness is your friend. When reading files "getContents" presents an entire file as a list, but its really evaluated lazily. This is implemented using unsafeInterleaveIO. I've never used this, but in theory you should be able to set up a query that returns the entire database as a list and then step through it using lazy evaluation in the same way.
I assume that the collectRows function in HSQL can produce this kind of a lazy list...right?
3: You don't say whether these algorithms are just row-by-row algorithms or whether there is something more sophisticated going on. Either way, try to make things into lists and then apply map, fold and filter operations. Its much more declarative and high level when you do it that way.
I'm going to need to do some mapping, folding, partitioning...
Let us know how you get on.
I certainly will.

On 10/1/06, Seth Gordon
Paul Johnson wrote:
I've done some stuff with maybe 50k rows at a time. A few bits and pieces:
1: I've used HSQL (http://sourceforge.net/project/showfiles.php?group_id=65248) to talk to ODBC databases. Works fine, but possibly a bit slowly. I'm not sure where the delay is: it might just be the network I was running it over. One gotcha: the field function takes a field name, but its not random access. Access the fields in query order or it crashes.
Thanks; that's certainly the sort of thing I like knowing in advance.
This behaviour depends on the underlying database. I remember that MSSQL suffers from this disease. In addition with MSSQL you can't have more than one opened dataset for a given connection. MySQL and PostgreSQL doesn't have this problem. HSQL can't hide all differences between the possible backends.
2: For large data sets laziness is your friend. When reading files "getContents" presents an entire file as a list, but its really evaluated lazily. This is implemented using unsafeInterleaveIO. I've never used this, but in theory you should be able to set up a query that returns the entire database as a list and then step through it using lazy evaluation in the same way.
I assume that the collectRows function in HSQL can produce this kind of a lazy list...right?
No. collectRows collects all records eagerly. The problem with the lazy fetching is that you can close the database connection while your lazy data sets aren't fetched yet. It is getting worse if you can't have multiple opened data sets. If you still want lazy fetching you can write a custom function like collectRows but using unsafeInterleaveIO. Cheers, Krasimir
participants (3)
-
Krasimir Angelov
-
Paul Johnson
-
Seth Gordon