I have a large database with a list of institutions (universities, hospitals, etc). The names of institutions come from different sources and can be spelled differently for the same institution. They can be misspelled, for example, or words can be shortened ("uni", or "univ", or "university")
Given a name that I need to insert in to the database, is there a practical way to find if this institution is already in the database? This is not a research project, so I am looking for a solution that is reasonably fast.
I am using django and postgresql, but it does not matter I suppose.
This is the problem of record linkage. Many databases provide basic methods for this such as character-level n-gram matching, where a term like "university" is expanded into
["uni", "niv", "ive", "ver", "ers", ...]
for n = 3. The database would index all such n-grams and allow a search with some kind of weighted matching. pg_trgm
seems to do exactly this, try it out.
You should probably look into using a dedicated search engine. Django-haystack enables you to easily add search engines like Solr, Whoosh or Xapian to your project.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With