Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

most efficient edit distance to identify misspellings in names?

Algorithms for edit distance give a measure of the distance between two strings.

Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The trick is that it should minimize false positives. Example:

Obaama Obama => should probably be merged

Obama Ibama => should not be merged.

This is just an oversimple example. Are their programmers and computer scientists who worked out this issue in more detail?

like image 629
seinecle Avatar asked Aug 12 '12 07:08

seinecle


2 Answers

I can suggest an information-retrieval technique of doing so, but it requires a large collection of documents in order to work properly.

Index your data, using the standard IR techniques. Lucene is a good open source library that can help you with it.

Once you get a name (Obaama for example): retrieve the set of collections the word Obaama appears in. Let this set be D1.
Now, for each word w in D11search for Obaama AND w (using your IR system). Let the set be D2.

The score |D2|/|D1| is an estimation how much w is connected to Obaama, and most likely will be close to 1 for w=Obama2.
You can manually label a set of examples and find the value from which words will be expected.

Using a standard lexicographical similarity technique you can chose to filter out words that are definetly not spelling mistakes (Like Barack).

Another solution that is often used requires a query log - find a correlation between searched words, if obaama has correlation with obama in the query log - they are connected.


1: You can improve performance by first doing the 2nd filter, and check only for candidates who are "similar enough" lexicographically.

2: Usually a normalization is also used, because more frequent words are more likely to be in the same documents with any word, regardless of being related or not.

like image 149
amit Avatar answered Sep 19 '22 20:09

amit


You can check NerSim (demo) which also uses SecondString. You can find their corresponding papers, or consider this paper: Robust Similarity Measures for Named Entities Matching.

like image 30
Kenston Choi Avatar answered Sep 18 '22 20:09

Kenston Choi