Approximate string matching of author names - modules and strategies

Tags:

I've created a small program that checks if authors are present in a database of authors. I haven't been able to find any specific modules for this problem, so I'm writing it from scratch using modules for approximate string matching.

The database contains around 6000 authors and is very poorly formatted (many typos, variations, titles such as "Dr.", etc). The query author list is usually between 500-1000 (and I have many of these lists), making speed quite important.

My general strategy is to trim and filter the database as much as possible and look for exact matches. If no matches are found, I move on to approximate string matching.

I'm currently using the built-in difflib.get_close_matches which does exactly what I want- however, it is extremely slow (several minutes). Therefore, I am looking for other options:

What is the fastest module that can return the best, say, 3 matches above some threshold in a database giving a query string?
What's the fastest module for comparing two strings?

The only one I have found is fuzzy wuzzy, which is even slower than difflib.

602

asked Dec 20 '12 21:12

Misconstruction

1 Answers

Try fuzzywuzzy with the native-C python-levenshtein lib installed.

I run a benchmark on my PC for finding the best candidates of 8 words within ~19k words-list with and without C-native levenshtein backend installed (using pip install python_Levenshtein-0.12.0-cp34-none-win_amd64.whl) and i got these timings:

No C-backend:
Compared 151664 words in 48.591717004776 sec (0.00032039058052521366 sec/search).
C-backend installed:
Compared 151664 words in 13.034106969833374 sec (8.594067787895198e-05 sec/search).

That is ~x4 faster (but not as much as i expected).

Here are the results:

0 of 8: Compared 'Lemaire' --> `[('L.', 90), ('Le', 90), ('A', 90), ('Re', 90), ('Em', 90)]`
1 of 8: Compared 'Peil' --> `[('L.', 90), ('E.', 90), ('Pfeil', 89), ('Gampel', 76), ('Jo-pei', 76)]`
2 of 8: Compared 'Singleton' --> `[('Eto', 90), ('Ng', 90), ('Le', 90), ('to', 90), ('On', 90)]`
3 of 8: Compared 'Tagoe' --> `[('Go', 90), ('A', 90), ('T', 90), ('E.', 90), ('Sagoe', 80)]`
4 of 8: Compared 'Jgoun' --> `[('Go', 90), ('Gon', 75), ('Journo', 73), ('Jaguin', 73), ('Gounaris', 72)]`
5 of 8: Compared 'Ben' --> `[('Benfer', 90), ('Bence', 90), ('Ben-Amotz', 90), ('Beniaminov', 90), ('Benczak', 90)]`
6 of 8: Compared 'Porte' --> `[('Porter', 91), ('Portet', 91), ('Porten', 91), ('Po', 90), ('Gould-Porter', 90)]`
7 of 8: Compared 'Nyla' --> `[('L.', 90), ('A', 90), ('Sirichanya', 76), ('Neyland', 73), ('Greenleaf', 67)]`

And here is the python-code of the benchmark:

import os
import zipfile
from urllib import request as urlrequest
from fuzzywuzzy import process as fzproc
import time
import random

download_url = 'http://www.outpost9.com/files/wordlists/actor-surname.zip'
zip_name = os.path.basename(download_url)
fname, _ = os.path.splitext(zip_name)

def fuzzy_match(dictionary, search):
    nsearch = len(search)
    for i, s in enumerate(search):
        best = fzproc.extractBests(s, dictionary)
        print("%i of %i: Compared '%s' --> `%s`" % (i, nsearch, s, best))

def benchmark_fuzzy_match(wordslist, dict_split_ratio=0.9996):
    """ Shuffle and split words-list into `dictionary` and `search-words`. """
    rnd = random.Random(0)
    rnd.shuffle(wordslist)
    nwords = len(wordslist)
    ndictionary = int(dict_split_ratio * nwords)

    dictionary = wordslist[:ndictionary]
    search = wordslist[ndictionary:]
    fuzzy_match(dictionary, search)

    return ndictionary, (nwords - ndictionary)

def run_benchmark():
    if not os.path.exists(zip_name):
        urlrequest.urlretrieve(download_url, filename=zip_name)

    with zipfile.ZipFile(zip_name, 'r') as zfile:
        with zfile.open(fname) as words_file:
            blines = words_file.readlines()
            wordslist = [line.decode('ascii').strip() for line in blines]
            wordslist = wordslist[4:]  # Skip header.

            t_start = time.time()
            ndict, nsearch = benchmark_fuzzy_match(wordslist)
            t_finish = time.time()

            t_elapsed = t_finish - t_start
            ncomparisons = ndict * nsearch
            sec_per_search = t_elapsed / ncomparisons
            msg = "Compared %s words in %s sec (%s sec/search)."
            print(msg % (ncomparisons, t_elapsed, sec_per_search))

if __name__ == '__main__':
    run_benchmark()

143

answered Oct 12 '22 23:10

ankostis

Related questions
                            
                                Django-Compressor cannot find files when it's actually there
                            
                                Issue with nosetests and file uploads
                            
                                managing python import dependencies in SublimeText2 plugin. specifically unicodedata lib
                            
                                MySql connector dies in Python
                            
                                Trying to get started tracing Python features and behavior through the source code
                            
                                Python Requests - Is it possible to receive a partial response after an HTTP POST?
                            
                                How django handles simultaneous requests with concurrency over global variables?
                            
                                Debugging mixed Python/C++ code in Eclipse
                            
                                celery trying shutdown worker by raising SystemExit in task_postrun signal but always hangs and the main process never exits
                            
                                LLVM IR to Python Compiler [closed]
                            
                                IPython Notebook - Detect if Kernel is Busy/Idle in Javascript
                            
                                PyQt4: Window shows up at another position after hide() and show()
                            
                                minimization in scipy, algorithm to find all local minima of a N dimensional scalar function
                            
                                Django / Python: Convert Markdown to Secure HTML
                            
                                GTK 3 - Popup menu to expand after adding items to it
                            
                                Numpy import fails when embedding python in c
                            
                                How to show and parse rsync progress in Python?
                            
                                Webapp2 sessions are lost between requests
                            
                                Python, Windows service Import error
                            
                                Python Urllib2 Cookiejar with Selenium

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Approximate string matching of author names - modules and strategies

Tags:

python

python-2.7

difflib

Misconstruction

People also ask

1 Answers

ankostis

Recent Activity

Donate For Us