Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word comparison algorithm

I am doing a CSV Import tool for the project I'm working on. The client needs to be able to enter the data in excel, export them as CSV and upload them to the database. For example I have this CSV record:

   1,   John Doe,     ACME Comapny   (the typo is on purpose)

Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting. I plan to do this by comparing the company names in the database with the company names in the CSV. the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:

"Acme Company" and "Acme Comapny" should have a very small difference index, but "Acme Company" and "Cmea Mpnyaco" should have a very big difference index Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different. Also, "Acme Company" and "Company Acme" should return 0.

So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.

Is there a known algorithm to do this, or maybe we can invent one :) ?

like image 400
disc0dancer Avatar asked Jan 23 '09 16:01

disc0dancer


2 Answers

You might want to check out the Levenshtein Distance algorithm as a starting point. It will rate the "distance" between two words.

This SO thread on implementing a Google-style "Do you mean...?" system may provide some ideas as well.

like image 117
MattK Avatar answered Nov 09 '22 08:11

MattK


I don't know what language you're coding in, but if it's PHP, you should consider the following algorithms:

levenshtein(): Returns the minimal number of characters you have to replace, insert or delete to transform one string into another.
soundex(): Returns the four-character soundex key of a word, which should be the same as the key for any similar-sounding word.
metaphone(): Similar to soundex, and possibly more effective for you. It's more accurate than soundex() as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.
similar_text(): Similar to levenshtein(), but it can return a percent value instead.

like image 26
Phantom Watson Avatar answered Nov 09 '22 08:11

Phantom Watson