Optimizing a title matcher

26 Sep 2006

      Hi folks,

I'm competing in a contest at work, and we're allowed to use whatever 
language we want. I decided this was my chance to prove to people that 
Haskell was up to the challenge. Unfortunately, I ran into performance 
problems. Since the contest ends this Friday, I've decided to switch to 
C++ (gasp!). But if any of you have advice on how to speed up this code, 
it could help me advocate Haskell in the future.

It's supposed to match movie titles from an imported database to a 
reference database. The version I've sent doesn't do anything very smart 
- it's just doing literal title matches. The first argument to the 
program is the filename of the table to be imported, and the second is 
the filename of the reference table. The first line of each table is a 
pipe-separated list of field names; the rest of the lines are records, 
each a pipe-separated list of values.

The import files each have 3,000 records, and the reference table has 
137,986 records.

Building the hash tables out of the files is quick - it just takes a few 
seconds. But doing the matching of title_id in one table to title_id in 
the other, in a nested loop between both tables, takes way too long. 
It's matching two import titles (against each of the reference titles) 
per second. It needs to do at least 20 per second to qualify for the 
contest, and it's not doing anything fancy yet.

I tried various "improvements" to speed it up. One was to specifically 
use ByteString, eliminating the AbsString class. Didn't make a 
difference. Another was to use arrays instead of lists to store each 
record, and precompute the indices of each of the fields within those 
records. I also iterated over a list of keys instead of the list of 
Maps, and only converted each record to a Map one at a time, hoping they 
would be disposed of sooner. Instead of speeding up the program, this 
slowed it down by a factor of 20!

I've profiled it, and I can't make much out of that. It seemed to be 
spending 25% of its time doing scoring, and I though the problem must be 
due to laziness, but I'm not sure.

So if anyone has any ideas how to speed this up by a factor of at least 
10 times, it would be really appreciated! Even the Ruby solutions are 
doing that, which is embarrassing.

Thanks,
Lyle

Lyle Kopnicky

Lemmih

Lyle Kopnicky

Robert Dockins

Lyle Kopnicky

Ketil Malde

Lyle Kopnicky

Brandon Moore

Bertram Felgenhauer

Lyle Kopnicky

Bulat Ziganshin

Bayley, Alistair

tags

participants (8)