
Hello Lyle, Wednesday, September 27, 2006, 12:44:05 AM, you wrote:
It's supposed to match movie titles from an imported database to a reference database.
The import files each have 3,000 records, and the reference table has 137,986 records.
Building the hash tables out of the files is quick - it just takes a few seconds. But doing the matching of title_id in one table to title_id in the other, in a nested loop between both tables, takes way too long.
something not unexpected :) i solve the similar problem in my program - when it updates archive, it should join two filelists, avoiding all the duplicates. and it processes 250k lists in a few seconds ;) basically it uses two techniques: first, use Hash to filter out repetitions: table <- Hash.new eqFunc hashFunc let insert_no_dups value = do let key = keyFunc value Hash.insert table key value mapM insert_no_dups main_list let select_file arcfile = do diskfile <- Hash.lookup table (keyFunc arcfile) return (diskfile==Nothing) result_list <- mapMaybeM select_file checked_list this code removes from checked_list items present in main_list second technique - merging of sorted lists: merge (sortOn keyFunc main_list) (sortOn keyFunc checked_list) ... i was so hot-interested in solving this problem that i've started my own program that does the trick. i attached the first part - those that manages the Map but uses extremely simple heuristic to find closest match. i've tried it on the program source itself - it was printed exactly :) you can try it yourself - 'database' file should contain Right Titles - one per line, while 'new' file should contain titles to check. now i will work on edit-distance algorithm. i'm have an idea of checking the real distance between chars - as on my keyboard, so for example "sprry" can be mistake for "sorry", but not for "serry" for a Real Performance, you should replace String with ByteString and probably Map with Hash, but not before algorithm itself will be finished. also your approach to split line into the fields is bad (unless you have some plans for other fields) - using just index into global list of full lines (as i do) will be enough it's also interesting to think about sort-and-merge approach, but at first look it seems unobvious how it can be used here -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com