Approximate string matching

Question

I know this question have been asked a lot of time. I want a suggestion on which algorithm is suitable for approximate string matching.

The application is specifically for company name matching only and nothing else.

The biggest challenge is probably the company end name part and short named part Example: 1. companyA pty ltd vs companyA pty. ltd. vs companyA 2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)

Do you think Levenshtein Edit Distance is adequate?

I'm using C#

Regards, Max

hashable · Accepted Answer

There are various string distance metrics you could use.

I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.

I haven't worked with C# but here are some implementations of JW I found online:

Impl 1 (They have a DOT NET version too if you look at the file list)

Impl 2

If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute

distance (normalize("foo corp."), normalize("FOO CORPORATION") )

you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).

Peter Alexander · Answer

Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.

You could also possibly use Soundex, but I don't think you'll need it.

Approximate string matching

Tags:

string

c#

matching

approximate

Max

2 Answers

hashable

Peter Alexander

Recent Activity

Donate For Us

Approximate string matching

Tags:

string

c#

matching

approximate

Max

2 Answers

hashable

Peter Alexander

Related questions

Recent Activity

Donate For Us