Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Approximate string matching of author names - modules and strategies

I've created a small program that checks if authors are present in a database of authors. I haven't been able to find any specific modules for this problem, so I'm writing it from scratch using modules for approximate string matching.

The database contains around 6000 authors and is very poorly formatted (many typos, variations, titles such as "Dr.", etc). The query author list is usually between 500-1000 (and I have many of these lists), making speed quite important.

My general strategy is to trim and filter the database as much as possible and look for exact matches. If no matches are found, I move on to approximate string matching.

I'm currently using the built-in difflib.get_close_matches which does exactly what I want- however, it is extremely slow (several minutes). Therefore, I am looking for other options:

  • What is the fastest module that can return the best, say, 3 matches above some threshold in a database giving a query string?
  • What's the fastest module for comparing two strings?

The only one I have found is fuzzy wuzzy, which is even slower than difflib.

like image 602
Misconstruction Avatar asked Dec 20 '12 21:12

Misconstruction


People also ask

How do you match a string in Python?

Exact match (equality comparison): == , != As with numbers, the == operator determines if two strings are equal. If they are equal, True is returned; if they are not, False is returned. It is case-sensitive, and the same applies to comparisons by other operators and methods.

What is string matching algorithm in DAA?

String Matching Algorithm is also called "String Searching Algorithm." This is a vital class of string algorithm is declared as "this is the method to find a place where one is several strings are found within the larger string." Given a text array, T [1.....n], of n character and a pattern array, P [1......

How does FuzzyWuzzy work?

Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns that was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcase them on one platform.


1 Answers

Try fuzzywuzzy with the native-C python-levenshtein lib installed.

I run a benchmark on my PC for finding the best candidates of 8 words within ~19k words-list with and without C-native levenshtein backend installed (using pip install python_Levenshtein-0.12.0-cp34-none-win_amd64.whl) and i got these timings:

  • No C-backend:
    Compared 151664 words in 48.591717004776 sec (0.00032039058052521366 sec/search).
  • C-backend installed:
    Compared 151664 words in 13.034106969833374 sec (8.594067787895198e-05 sec/search).

That is ~x4 faster (but not as much as i expected).

Here are the results:

0 of 8: Compared 'Lemaire' --> `[('L.', 90), ('Le', 90), ('A', 90), ('Re', 90), ('Em', 90)]`
1 of 8: Compared 'Peil' --> `[('L.', 90), ('E.', 90), ('Pfeil', 89), ('Gampel', 76), ('Jo-pei', 76)]`
2 of 8: Compared 'Singleton' --> `[('Eto', 90), ('Ng', 90), ('Le', 90), ('to', 90), ('On', 90)]`
3 of 8: Compared 'Tagoe' --> `[('Go', 90), ('A', 90), ('T', 90), ('E.', 90), ('Sagoe', 80)]`
4 of 8: Compared 'Jgoun' --> `[('Go', 90), ('Gon', 75), ('Journo', 73), ('Jaguin', 73), ('Gounaris', 72)]`
5 of 8: Compared 'Ben' --> `[('Benfer', 90), ('Bence', 90), ('Ben-Amotz', 90), ('Beniaminov', 90), ('Benczak', 90)]`
6 of 8: Compared 'Porte' --> `[('Porter', 91), ('Portet', 91), ('Porten', 91), ('Po', 90), ('Gould-Porter', 90)]`
7 of 8: Compared 'Nyla' --> `[('L.', 90), ('A', 90), ('Sirichanya', 76), ('Neyland', 73), ('Greenleaf', 67)]`

And here is the python-code of the benchmark:

import os
import zipfile
from urllib import request as urlrequest
from fuzzywuzzy import process as fzproc
import time
import random

download_url = 'http://www.outpost9.com/files/wordlists/actor-surname.zip'
zip_name = os.path.basename(download_url)
fname, _ = os.path.splitext(zip_name)

def fuzzy_match(dictionary, search):
    nsearch = len(search)
    for i, s in enumerate(search):
        best = fzproc.extractBests(s, dictionary)
        print("%i of %i: Compared '%s' --> `%s`" % (i, nsearch, s, best))

def benchmark_fuzzy_match(wordslist, dict_split_ratio=0.9996):
    """ Shuffle and split words-list into `dictionary` and `search-words`. """
    rnd = random.Random(0)
    rnd.shuffle(wordslist)
    nwords = len(wordslist)
    ndictionary = int(dict_split_ratio * nwords)

    dictionary = wordslist[:ndictionary]
    search = wordslist[ndictionary:]
    fuzzy_match(dictionary, search)

    return ndictionary, (nwords - ndictionary)

def run_benchmark():
    if not os.path.exists(zip_name):
        urlrequest.urlretrieve(download_url, filename=zip_name)

    with zipfile.ZipFile(zip_name, 'r') as zfile:
        with zfile.open(fname) as words_file:
            blines = words_file.readlines()
            wordslist = [line.decode('ascii').strip() for line in blines]
            wordslist = wordslist[4:]  # Skip header.

            t_start = time.time()
            ndict, nsearch = benchmark_fuzzy_match(wordslist)
            t_finish = time.time()

            t_elapsed = t_finish - t_start
            ncomparisons = ndict * nsearch
            sec_per_search = t_elapsed / ncomparisons
            msg = "Compared %s words in %s sec (%s sec/search)."
            print(msg % (ncomparisons, t_elapsed, sec_per_search))

if __name__ == '__main__':
    run_benchmark()
like image 143
ankostis Avatar answered Oct 12 '22 23:10

ankostis