Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spelling correction for person names (Python)

I have a large collection of person names (e.g. "john smith"). I want to look up people by name in it. However, some of the queries will be misspelled (e.g. "jon smth", "johnsm ith"). Are there any spelling correction libraries with Python bindings that might find spelling-corrected matches for me?

I'm aware of Whoosh and Python-aspell. Whoosh's spelling correction doesn't quite work for me because it writes the collection of correct spellings to disk rather than storing it in memory. That makes lookups too slow for my application. It doesn't seem to be trivial to change this behavior, because of how the code is structured. Also Whoosh does not weight different character-edits differently even though, say, a 'y' is much more likely to be confused with an 'i' ("jim kazinsky" -> "jim kazinski") than it is a 'z'.

Aspell doesn't work well with person names, since names typically contain white space -- Aspell considers the word to be the fundamental unit of correction. Also, as I understand it, Aspell uses an n-gram model of spelling correction, rather than a character-edit distance model. While an n-gram model makes sense for dictionary words, it doesn't work as well for names: the people "bob ruzatoxg" and "joe ruzatoxg" have a lot of rare trigrams in common, since they have the same rare last name. But they're clearly different people.

I should also mention that I can't just compare each query to all of the entries in the collection -- that would be too slow. Some index needs to get built beforehand.

Thanks!

like image 858
Jeff Avatar asked Dec 03 '12 19:12

Jeff


1 Answers

Sounds like (no pun intended there) you need some form of the Metaphone algorithm, which has been implemented in Python and is available on Pypi: http://pypi.python.org/pypi/Metaphone/0.4.

There are other algorithms too, such as Levenshtein and Soundex (haven't yet found a reliable Python implementation of this) - you might want to calculate some form of metric using more than one of these (perhaps even give different weighting to each result) to come up with a results list of likely corrections.

like image 157
ChrisW Avatar answered Oct 14 '22 07:10

ChrisW