Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching inexact company names in Java

Tags:

java

matching

I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.

For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".

Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.

The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.

How would you do this in Java?

like image 328
Sophie Gage Avatar asked Nov 27 '08 01:11

Sophie Gage


4 Answers

You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.

You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)

like image 173
Drew Hall Avatar answered Sep 24 '22 06:09

Drew Hall


Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:

https://code.google.com/p/java-similarities/

If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.

(BTW, I'm not the author of the library, so kudos for its creators.)

like image 38
Zsolt Katona Avatar answered Sep 23 '22 06:09

Zsolt Katona


You can use an LCS algorithm to score them.

I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.

  • LCS code
  • Example usage (guessing a category based on what people entered)
like image 23
Dustin Avatar answered Sep 21 '22 06:09

Dustin


I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.

like image 33
Adam Jaskiewicz Avatar answered Sep 21 '22 06:09

Adam Jaskiewicz