
Hi folks, I'm competing in a contest at work, and we're allowed to use whatever language we want. I decided this was my chance to prove to people that Haskell was up to the challenge. Unfortunately, I ran into performance problems. Since the contest ends this Friday, I've decided to switch to C++ (gasp!). But if any of you have advice on how to speed up this code, it could help me advocate Haskell in the future. It's supposed to match movie titles from an imported database to a reference database. The version I've sent doesn't do anything very smart - it's just doing literal title matches. The first argument to the program is the filename of the table to be imported, and the second is the filename of the reference table. The first line of each table is a pipe-separated list of field names; the rest of the lines are records, each a pipe-separated list of values. The import files each have 3,000 records, and the reference table has 137,986 records. Building the hash tables out of the files is quick - it just takes a few seconds. But doing the matching of title_id in one table to title_id in the other, in a nested loop between both tables, takes way too long. It's matching two import titles (against each of the reference titles) per second. It needs to do at least 20 per second to qualify for the contest, and it's not doing anything fancy yet. I tried various "improvements" to speed it up. One was to specifically use ByteString, eliminating the AbsString class. Didn't make a difference. Another was to use arrays instead of lists to store each record, and precompute the indices of each of the fields within those records. I also iterated over a list of keys instead of the list of Maps, and only converted each record to a Map one at a time, hoping they would be disposed of sooner. Instead of speeding up the program, this slowed it down by a factor of 20! I've profiled it, and I can't make much out of that. It seemed to be spending 25% of its time doing scoring, and I though the problem must be due to laziness, but I'm not sure. So if anyone has any ideas how to speed this up by a factor of at least 10 times, it would be really appreciated! Even the Ruby solutions are doing that, which is embarrassing. Thanks, Lyle

On 9/26/06, Lyle Kopnicky
Hi folks,
I'm competing in a contest at work, and we're allowed to use whatever language we want. I decided this was my chance to prove to people that Haskell was up to the challenge. Unfortunately, I ran into performance problems. Since the contest ends this Friday, I've decided to switch to C++ (gasp!). But if any of you have advice on how to speed up this code, it could help me advocate Haskell in the future.
It's supposed to match movie titles from an imported database to a reference database. The version I've sent doesn't do anything very smart - it's just doing literal title matches. The first argument to the program is the filename of the table to be imported, and the second is the filename of the reference table. The first line of each table is a pipe-separated list of field names; the rest of the lines are records, each a pipe-separated list of values.
The import files each have 3,000 records, and the reference table has 137,986 records.
Building the hash tables out of the files is quick - it just takes a few seconds. But doing the matching of title_id in one table to title_id in the other, in a nested loop between both tables, takes way too long. It's matching two import titles (against each of the reference titles) per second. It needs to do at least 20 per second to qualify for the contest, and it's not doing anything fancy yet.
I tried various "improvements" to speed it up. One was to specifically use ByteString, eliminating the AbsString class. Didn't make a difference. Another was to use arrays instead of lists to store each record, and precompute the indices of each of the fields within those records. I also iterated over a list of keys instead of the list of Maps, and only converted each record to a Map one at a time, hoping they would be disposed of sooner. Instead of speeding up the program, this slowed it down by a factor of 20!
I've profiled it, and I can't make much out of that. It seemed to be spending 25% of its time doing scoring, and I though the problem must be due to laziness, but I'm not sure.
So if anyone has any ideas how to speed this up by a factor of at least 10 times, it would be really appreciated! Even the Ruby solutions are doing that, which is embarrassing.
Do you have some test input online? -- Cheers, Lemmih

Lemmih wrote:
Do you have some test input online?
I've attached some (very short) input files. Sorry I can't provide more - they're proprietary databases. I know that means you can't actually test the performance, but can only give me advice. At least you can run the program on them, and there is one match. - Lyle title_id|ref_title_id|importance|title|studio|run_time|rating|theatrical_release|home_video_release|box_office_gross|synopsis|actors|director 26246|157067|2|Sniper|TRI||R|29-JAN-93|04-AUG-93|18994653|A rebel leader and a wealthy drug lord are pursued in Panama by a marine sergeant and Oympic marksman.|Dale Dye;Aden Young;Tom Berenger;J.T. Walsh;Richard Lineback;Billy Zane|Luis Llosa 68308|174883|2|Northern Passage||97|||27-FEB-96|0|In the wilds of North America in the 1800s, Paul a young zoologist has pledged to keep as eye on his longtime friend's beautiful daughter, Nepeese. Nepeese and Paul fall in love, but their relationship is interrupted when an abusive fur trader kidnaps Nepeese. Starring Jeff Fahey.|Jeff Fahey;Neve Campbell| 68591|196655|2|Old Man And The Sea (1990)||97|||01-JUL-96||An old man heads out to sea, hooks a large fish, and battles a shark who has taken his catch.|Anthony Quinn;Alexis Cruz| 51506|190181|2|Shoot Out||95|||27-APR-99|||Gregory Peck;Robert F. Lyons| title_id|importance|title|studio|runtime|rating|theatrical_release|home_video_release|theatrical_box_office_gross|actors|director 214063|2|Elvis: His Best Friend Remembers|UNI||NR||30-JUL-02||Elvis Presley| 133868|2|Mighty Hercules, The - Speed Racer - V. 5||30|NR||||| 174883|2|Northern Passage|VMM|97|PG-13||18-JAN-00||Jeff Fahey;Neve Campbell;Jacques Weber;Lorne Brass;Genevieve Rochette|Arnaud Selignac 121430|2|Appointment With Death|WAR|||15-APR-88||960040|Peter Ustinov;Lauren Bacall;Carrie Fisher;Piper Laurie;John Gielgud;Hayley Mills;Jenny Seagrove;David Soul|Michael Winner

On Tuesday 26 September 2006 16:44, Lyle Kopnicky wrote:
Hi folks,
I'm competing in a contest at work, and we're allowed to use whatever language we want. I decided this was my chance to prove to people that Haskell was up to the challenge. Unfortunately, I ran into performance problems. Since the contest ends this Friday, I've decided to switch to C++ (gasp!). But if any of you have advice on how to speed up this code, it could help me advocate Haskell in the future.
It's supposed to match movie titles from an imported database to a reference database. The version I've sent doesn't do anything very smart - it's just doing literal title matches. The first argument to the program is the filename of the table to be imported, and the second is the filename of the reference table. The first line of each table is a pipe-separated list of field names; the rest of the lines are records, each a pipe-separated list of values.
The import files each have 3,000 records, and the reference table has 137,986 records.
Building the hash tables out of the files is quick - it just takes a few seconds. But doing the matching of title_id in one table to title_id in the other, in a nested loop between both tables, takes way too long. It's matching two import titles (against each of the reference titles) per second. It needs to do at least 20 per second to qualify for the contest, and it's not doing anything fancy yet.
Humm... well, double nested loops seems like the wrong approach. Also, if you are using GHC, it's hashtable implementation has farily well-known performance problems. If all you care about is exact matching, then the operation is essentially a finite map intersection (if I've understood what you are trying to do). This is just a guess, but I suspect you will probably get much better performance (and better-looking code!) by just using Data.Map.intersection http://www.haskell.org/ghc/docs/latest/html/libraries/base/Data-Map.html#v%3... Alternately, there is the ternary trie implementation from Edison (http://www.eecs.tufts.edu/~rdocki01) that may also work for you. If you need to do prefix matching, then a trie is the way to go. You can probably code up a nice prefix-intersection operation using tries that should go pretty fast. If you have some other metric other than prefix in mind for partial matches, then things probably get a lot more complicated. You're probably looking at calculating minimum distances in some feature-space, which calls for pretty sophisticated algorithms if you need good performance.
I tried various "improvements" to speed it up. One was to specifically use ByteString, eliminating the AbsString class. Didn't make a difference. Another was to use arrays instead of lists to store each record, and precompute the indices of each of the fields within those records. I also iterated over a list of keys instead of the list of Maps, and only converted each record to a Map one at a time, hoping they would be disposed of sooner. Instead of speeding up the program, this slowed it down by a factor of 20!
I've profiled it, and I can't make much out of that. It seemed to be spending 25% of its time doing scoring, and I though the problem must be due to laziness, but I'm not sure.
So if anyone has any ideas how to speed this up by a factor of at least 10 times, it would be really appreciated! Even the Ruby solutions are doing that, which is embarrassing.
Thanks, Lyle
-- Rob Dockins Talk softly and drive a Sherman tank. Laugh hard, it's a long way to the bank. -- TMBG

Robert Dockins wrote:
Humm... well, double nested loops seems like the wrong approach. It may be. I had hoped it would be fast enough that way. Also, if you are using GHC, it's hashtable implementation has farily well-known performance problems. If all you care about is exact matching, then the operation is essentially a finite map intersection (if I've understood what you are trying to do).
No, I don't just care about exact matching. I care about very fuzzy matching. It's just that the code I've implemented so far only does exact matching on the title strings. It's a first step.
This is just a guess, but I suspect you will probably get much better performance (and better-looking code!) by just using Data.Map.intersection
http://www.haskell.org/ghc/docs/latest/html/libraries/base/Data-Map.html#v%3...
Thanks, but that won't help with the fuzzy matching.
Alternately, there is the ternary trie implementation from Edison (http://www.eecs.tufts.edu/~rdocki01) that may also work for you.
That may be useful.
If you need to do prefix matching, then a trie is the way to go. You can probably code up a nice prefix-intersection operation using tries that should go pretty fast.
If you have some other metric other than prefix in mind for partial matches, then things probably get a lot more complicated. You're probably looking at calculating minimum distances in some feature-space, which calls for pretty sophisticated algorithms if you need good performance.
Yes, that's the kind of thing I'm looking at doing. Looking at edit distance. Thanks, Lyle

Lyle Kopnicky
If you have some other metric other than prefix in mind for partial matches, then things probably get a lot more complicated. You're probably looking at calculating minimum distances in some feature-space, which calls for pretty sophisticated algorithms if you need good performance.
Yes, that's the kind of thing I'm looking at doing. Looking at edit distance.
Do you really need that to search for movie titles? At any rate, an exact-match finite-map implementation is a good start - to get good performance, you probably will need to use some kind of index to reduce the amount of data to search exhaustively (all-against-all). For text searching I think it is effective to use an index that maps from words (so that looking up a word gives you all the movies with that word in the title). -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde wrote:
Do you really need that to search for movie titles? At any rate, an exact-match finite-map implementation is a good start - to get good performance, you probably will need to use some kind of index to reduce the amount of data to search exhaustively (all-against-all).
For text searching I think it is effective to use an index that maps from words (so that looking up a word gives you all the movies with that word in the title).
Gotcha. That's exactly the approach I've switched to. It is possible to miss titles, if words are misspelled, but it's unlikely that all words in the title will be misspelled, so you can at least narrow your search to titles that have at least one matching (non-trivial) word. - Lyle

Ketil Malde wrote:
Lyle Kopnicky
writes: If you have some other metric other than prefix in mind for partial matches, then things probably get a lot more complicated. You're probably looking at calculating minimum distances in some feature-space, which calls for pretty sophisticated algorithms if you need good performance.
Yes, that's the kind of thing I'm looking at doing. Looking at edit distance.
Do you really need that to search for movie titles? At any rate, an exact-match finite-map implementation is a good start - to get good performance, you probably will need to use some kind of index to reduce the amount of data to search exhaustively (all-against-all).
For text searching I think it is effective to use an index that maps from words (so that looking up a word gives you all the movies with that word in the title).
-k
If you really need to do edit distance, perhaps something like "Low Distortion Embeddings for Edit Distance" http://www.cs.ucla.edu/~rafail/PUBLIC/68.pdf would help? This is just one of the first result from a search, probably there are plenty of things out there. Make up from any constant factors from your language choice with advanced algorithms! (and any constant is getting pretty small these days) Brandon

Lyle Kopnicky wrote: [snip]
data TextTable s = TextTable { tableFields :: ![s], keyFieldIndex :: !Int, tableRecords :: !(HashTable s (Array Int s)) }
[snip]
listRecords :: AbsString s => TextTable s -> IO [TextRecord s] listRecords (TextTable fields _ records) = do keyRecs <- HT.toList records return $ map (fromList . zip fields . elems . snd) keyRecs
Doing fromList again and again can't be good. Why don't you make tableFields a map that maps names to array indices? Then you can just pass the bare arrays along, and the later lookups will be cheaper, too. Now due to lazyness this will probably be evaluated in matchscore, because before that the resulting Map isn't used. Which is exactly where you said a lot (most?) of the time is spent. Another thing is that you should compile your code with -O, but I guess you are already doing that. Hope this helps, Bertram

Bertram Felgenhauer wrote:
Lyle Kopnicky wrote:
[snip]
listRecords :: AbsString s => TextTable s -> IO [TextRecord s] listRecords (TextTable fields _ records) = do keyRecs <- HT.toList records return $ map (fromList . zip fields . elems . snd) keyRecs
Doing fromList again and again can't be good. Why don't you make tableFields a map that maps names to array indices? Then you can just pass the bare arrays along, and the later lookups will be cheaper, too.
That might make a difference. It does spoil the interface a bit, since now the caller has to look up a field name to get an index, then use that to look up a value, instead of just using the field name to get the value.
Now due to lazyness this will probably be evaluated in matchscore, because before that the resulting Map isn't used. Which is exactly where you said a lot (most?) of the time is spent.
Yes, likely. I had to run on a small pair of files in order to get the profiling to work. So, probably more time was spent in matchScore than it admitted. (The overhead of the initial read would decrease as the table size increases.)
Another thing is that you should compile your code with -O, but I guess you are already doing that.
Yep. Thanks. - Lyle

Hello Lyle, Wednesday, September 27, 2006, 12:44:05 AM, you wrote:
It's supposed to match movie titles from an imported database to a reference database.
The import files each have 3,000 records, and the reference table has 137,986 records.
Building the hash tables out of the files is quick - it just takes a few seconds. But doing the matching of title_id in one table to title_id in the other, in a nested loop between both tables, takes way too long.
something not unexpected :) i solve the similar problem in my program - when it updates archive, it should join two filelists, avoiding all the duplicates. and it processes 250k lists in a few seconds ;) basically it uses two techniques: first, use Hash to filter out repetitions: table <- Hash.new eqFunc hashFunc let insert_no_dups value = do let key = keyFunc value Hash.insert table key value mapM insert_no_dups main_list let select_file arcfile = do diskfile <- Hash.lookup table (keyFunc arcfile) return (diskfile==Nothing) result_list <- mapMaybeM select_file checked_list this code removes from checked_list items present in main_list second technique - merging of sorted lists: merge (sortOn keyFunc main_list) (sortOn keyFunc checked_list) ... i was so hot-interested in solving this problem that i've started my own program that does the trick. i attached the first part - those that manages the Map but uses extremely simple heuristic to find closest match. i've tried it on the program source itself - it was printed exactly :) you can try it yourself - 'database' file should contain Right Titles - one per line, while 'new' file should contain titles to check. now i will work on edit-distance algorithm. i'm have an idea of checking the real distance between chars - as on my keyboard, so for example "sprry" can be mistake for "sorry", but not for "serry" for a Real Performance, you should replace String with ByteString and probably Map with Hash, but not before algorithm itself will be finished. also your approach to split line into the fields is bad (unless you have some plans for other fields) - using just index into global list of full lines (as i do) will be enough it's also interesting to think about sort-and-merge approach, but at first look it seems unobvious how it can be used here -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Bulat Ziganshin Sent: 28 September 2006 17:29 To: Lyle Kopnicky
now i will work on edit-distance algorithm. i'm have an idea of checking the real distance between chars - as on my keyboard, so for example "sprry" can be mistake for "sorry", but not for "serry"
For English text, the Soundex algorithm (or one of its variations) might do a good job of matching: http://en.wikipedia.org/wiki/Soundex Google doesn't show any Haskell implementations yet. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************
participants (8)
-
Bayley, Alistair
-
Bertram Felgenhauer
-
Brandon Moore
-
Bulat Ziganshin
-
Ketil Malde
-
Lemmih
-
Lyle Kopnicky
-
Robert Dockins