Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

approximate search in a database

I have a large database with a list of institutions (universities, hospitals, etc). The names of institutions come from different sources and can be spelled differently for the same institution. They can be misspelled, for example, or words can be shortened ("uni", or "univ", or "university")

Given a name that I need to insert in to the database, is there a practical way to find if this institution is already in the database? This is not a research project, so I am looking for a solution that is reasonably fast.

I am using django and postgresql, but it does not matter I suppose.

like image 957
akonsu Avatar asked Oct 12 '11 13:10

akonsu


2 Answers

This is the problem of record linkage. Many databases provide basic methods for this such as character-level n-gram matching, where a term like "university" is expanded into

["uni", "niv", "ive", "ver", "ers", ...]

for n = 3. The database would index all such n-grams and allow a search with some kind of weighted matching. pg_trgm seems to do exactly this, try it out.

like image 129
Fred Foo Avatar answered Nov 15 '22 23:11

Fred Foo


You should probably look into using a dedicated search engine. Django-haystack enables you to easily add search engines like Solr, Whoosh or Xapian to your project.

like image 26
Bernhard Vallant Avatar answered Nov 15 '22 23:11

Bernhard Vallant