Is there a way to filter a django queryset based on string similarity (a la python difflib)?

Tags:

I have a need to match cold leads against a database of our clients.

The leads come from a third party provider in bulk (thousands of records) and sales is asking us to (in their words) "filter out our clients" so they don't try to sell our service to a established client.

Obviously, there are misspellings in the leads. Charles becomes Charlie, Joseph becomes Joe, etc. So I can't really just do a filter comparing lead_first_name to client_first_name, etc.

I need to use some sort of string similarity mechanism.

Right now I'm using the lovely difflib to compare the leads' first and last names to a list generated with Client.objects.all(). It works, but because of the number of clients it tends to be slow.

I know that most sql databases have soundex and difference functions. See my test of it in the update below - it doesn't work as well as difflib.

Is there another solution? Is there a better solution?

Edit:

Soundex, at least in my db, doesn't behave as well as difflib.

Here is a simple test - look for "Joe Lopes" in a table containing "Joseph Lopes":

Click to copy

with temp (first_name, last_name) as (
select 'Joseph', 'Lopes'
union
select 'Joe', 'Satriani'
union
select 'CZ', 'Lopes'
union
select 'Blah', 'Lopes'
union
select 'Antonio', 'Lopes'
union
select 'Carlos', 'Lopes'
)
select first_name, last_name
  from temp
 where difference(first_name+' '+last_name, 'Joe Lopes') >= 3
 order by difference(first_name+' '+last_name, 'Joe Lopes')

The above returns "Joe Satriani" as the only match. Even reducing the similarity threshold to 2 doesn't return "Joseph Lopes" as a potential match.

But difflib does a much better job:

Click to copy

difflib.get_close_matches('Joe Lopes', ['Joseph Lopes', 'Joe Satriani', 'CZ Lopes', 'Blah Lopes', 'Antonio Lopes', 'Carlos Lopes'])
['Joseph Lopes', 'CZ Lopes', 'Carlos Lopes']

Edit after gruszczy's response:

Before writing my own, I looked for and found a T-SQL implementation of Levenshtein Distance in the repository of all knowledge.

In testing it, it still won't do a better matching job than difflib.

Which led me to research what algorithm is behind difflib. It seems to be a modified version of the Ratcliff-Obershelp algorithm.

Unhappily I can't seem to find some other kind soul who has already created a T-SQL implementation based on difflib's... I'll try my hand at it when I can.

If nobody else comes up with a better answer in the next few days, I'll grant it to gruszczy. Thanks, kind sir.

213

asked Jul 29 '10 19:07

cethegeek

2 Answers

soundex won't help you, because it's a phonetic algorithm. Joe and Joseph aren't similar phonetically, so soundex won't mark them as similar.

You can try Levenshtein distance, which is implemented in PostgreSQL. Maybe in your database too and if not, you should be able to write a stored procedure, which will calculate the distance between two strings and use it in your computation.

156

answered Sep 29 '22 19:09

gruszczy

It's possible with trigram_similar lookups since Django 1.10, see docs for PostgreSQL specific lookups and Full text search

answered Sep 29 '22 18:09

ckarrie

Related questions
                            
                                Django Pagination too slow with large dataset
                            
                                Setting up a scheduled / cron job with Django on Elastic Beanstalk with a Worker Tier
                            
                                Decide when to refresh OAUTH2 token with Python Social Auth
                            
                                ImportError: No module named setuptools.command on Mac OS X within virtualenv
                            
                                Access HDF files stored on s3 in pandas
                            
                                Django, JSONField, Postgres, and F() object comparison
                            
                                Django ImageField widget that accepts upload or external link as source
                            
                                Django/Haystack error: elasticsearch.exceptions.RequestError: TransportError(400, 'parsing_exception',...)
                            
                                Docker Django could not connect to server: Connection refused
                            
                                Indexing and searching related objects with haystack
                            
                                Django Rest Framework or JsonResponse
                            
                                Django: Sessions not working as expected on Heroku
                            
                                Database still in use after a selenium test in Django
                            
                                Is there a way to get a referring URL via a custom HTTP header?
                            
                                Correct way to use async class based views in Django
                            
                                How to setup Django permissions to be specific to a certain model's instances?
                            
                                Python: Behavior of the garbage collector
                            
                                Using Django Forms to display and edit?
                            
                                Can you recommend a good django file manager for the admin? [closed]
                            
                                Which solution is better for Django social authentication?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a way to filter a django queryset based on string similarity (a la python difflib)?

Tags:

django

django-queryset

similarity

cethegeek

People also ask

2 Answers

gruszczy

ckarrie

Recent Activity

Donate For Us