I've created a small program that checks if authors are present in a database of authors. I haven't been able to find any specific modules for this problem, so I'm writing it from scratch using modules for approximate string matching.
The database contains around 6000 authors and is very poorly formatted (many typos, variations, titles such as "Dr.", etc). The query author list is usually between 500-1000 (and I have many of these lists), making speed quite important.
My general strategy is to trim and filter the database as much as possible and look for exact matches. If no matches are found, I move on to approximate string matching.
I'm currently using the built-in difflib.get_close_matches
which does exactly what I want- however, it is extremely slow (several minutes). Therefore, I am looking for other options:
The only one I have found is fuzzy wuzzy, which is even slower than difflib.
Exact match (equality comparison): == , != As with numbers, the == operator determines if two strings are equal. If they are equal, True is returned; if they are not, False is returned. It is case-sensitive, and the same applies to comparisons by other operators and methods.
String Matching Algorithm is also called "String Searching Algorithm." This is a vital class of string algorithm is declared as "this is the method to find a place where one is several strings are found within the larger string." Given a text array, T [1.....n], of n character and a pattern array, P [1......
Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns that was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcase them on one platform.
Try fuzzywuzzy with the native-C python-levenshtein lib installed.
I run a benchmark on my PC for finding the best candidates of 8 words within ~19k words-list with and without C-native levenshtein backend installed (using pip install python_Levenshtein-0.12.0-cp34-none-win_amd64.whl
) and i got these timings:
That is ~x4 faster (but not as much as i expected).
Here are the results:
0 of 8: Compared 'Lemaire' --> `[('L.', 90), ('Le', 90), ('A', 90), ('Re', 90), ('Em', 90)]`
1 of 8: Compared 'Peil' --> `[('L.', 90), ('E.', 90), ('Pfeil', 89), ('Gampel', 76), ('Jo-pei', 76)]`
2 of 8: Compared 'Singleton' --> `[('Eto', 90), ('Ng', 90), ('Le', 90), ('to', 90), ('On', 90)]`
3 of 8: Compared 'Tagoe' --> `[('Go', 90), ('A', 90), ('T', 90), ('E.', 90), ('Sagoe', 80)]`
4 of 8: Compared 'Jgoun' --> `[('Go', 90), ('Gon', 75), ('Journo', 73), ('Jaguin', 73), ('Gounaris', 72)]`
5 of 8: Compared 'Ben' --> `[('Benfer', 90), ('Bence', 90), ('Ben-Amotz', 90), ('Beniaminov', 90), ('Benczak', 90)]`
6 of 8: Compared 'Porte' --> `[('Porter', 91), ('Portet', 91), ('Porten', 91), ('Po', 90), ('Gould-Porter', 90)]`
7 of 8: Compared 'Nyla' --> `[('L.', 90), ('A', 90), ('Sirichanya', 76), ('Neyland', 73), ('Greenleaf', 67)]`
And here is the python-code of the benchmark:
import os
import zipfile
from urllib import request as urlrequest
from fuzzywuzzy import process as fzproc
import time
import random
download_url = 'http://www.outpost9.com/files/wordlists/actor-surname.zip'
zip_name = os.path.basename(download_url)
fname, _ = os.path.splitext(zip_name)
def fuzzy_match(dictionary, search):
nsearch = len(search)
for i, s in enumerate(search):
best = fzproc.extractBests(s, dictionary)
print("%i of %i: Compared '%s' --> `%s`" % (i, nsearch, s, best))
def benchmark_fuzzy_match(wordslist, dict_split_ratio=0.9996):
""" Shuffle and split words-list into `dictionary` and `search-words`. """
rnd = random.Random(0)
rnd.shuffle(wordslist)
nwords = len(wordslist)
ndictionary = int(dict_split_ratio * nwords)
dictionary = wordslist[:ndictionary]
search = wordslist[ndictionary:]
fuzzy_match(dictionary, search)
return ndictionary, (nwords - ndictionary)
def run_benchmark():
if not os.path.exists(zip_name):
urlrequest.urlretrieve(download_url, filename=zip_name)
with zipfile.ZipFile(zip_name, 'r') as zfile:
with zfile.open(fname) as words_file:
blines = words_file.readlines()
wordslist = [line.decode('ascii').strip() for line in blines]
wordslist = wordslist[4:] # Skip header.
t_start = time.time()
ndict, nsearch = benchmark_fuzzy_match(wordslist)
t_finish = time.time()
t_elapsed = t_finish - t_start
ncomparisons = ndict * nsearch
sec_per_search = t_elapsed / ncomparisons
msg = "Compared %s words in %s sec (%s sec/search)."
print(msg % (ncomparisons, t_elapsed, sec_per_search))
if __name__ == '__main__':
run_benchmark()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With