I am currently processing a very large database of locations and trying to match them with their real world coordinates.
To achieve this, I have downloaded the geoname dataset which contains a lot of entries. It gives possible names and lat/long coordinates. To try and speed up the process, I have managed to reduce the huge csv file (of 1.6 GB) to 0.450 GB by removing entries that do not make sense for my dataset. It still contains however 4 million entries.
Now I have many entries such as:
Knowing that string matching with such long strings, I used Standford's NER via NLTK to get a better string to qualify my location. Now I have strings like:
The geoname dataset contains things like:
And I am applying this algorithm to get a good possible match between my entries and the geoname csv containing 4M entries. I first read the geoname_cleaned.csv file and put all of the data into a list. For each entry I have I then call for each one of my entries string_similarity()
between the current entry and all the entries of the geoname_list
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in list(range(len(s) - 1))]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
break
return (2.0 * hit_count) / union
I have tested the algorithm on a subset of my original dataset and it works fine but it is obviously terribly slow (takes up to 40 seconds for a single location). Since I have more than a million entries to process, this will take a good 10000 hours or more. I was wondering if you guys had any idea on how to speed this up. I thought of parallel processing obviously but I don't have any HPC solution available. Perhaps simple ideas could help me speed this up.
I'm open to any and every idea that you guys might have but would somehow prefer a python-compatible solution.
Thanks in advance :).
Edit:
I have tried fuzzywuzzy with fuzz.token_set_ratio(s1, s2)
and it gives worst performances (running time is worse, and results are not as good). Matches are not as good as they used to be with my custom technique and running time has increased by a good 15 seconds for a single entry.
Edit 2:
I also though of using some kind of sorting at the beginning to help with the matching but my naive implementation would not work. But I'm sure there are some ways to speed this up, by perhaps getting rid of some entries in geoname dataset, or sorting them in some way. I already did a lot of cleaning to remove useless entries, but can't get the number much lower than 4M
The Karp-Rabin Algorithm.
When it comes to fuzzy string match, the first solution data scientists typically take is FuzzyWuzzy. FuzzyWuzzy package is a Levenshtein distance based method which widely used in computing similarity scores of strings. But why we should not use it? The answer is simple: it is way too slow.
String Matching Algorithm is also called "String Searching Algorithm." This is a vital class of string algorithm is declared as "this is the method to find a place where one is several strings are found within the larger string."
Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.
We can speed up the matching in a couple of ways. I assume that in your code str1
is a name from your dataset and str2
is a geoname string. To test the code I made two tiny data sets from the data in your question. And I wrote two matching functions best_match
and first_match
that use your current string_similarity
function so we can see that my strategy gives the same results. best_match
checks all geoname strings & returns the string with the highest score if it exceeds a given threshold score, otherwise it returns None
. first_match
is (potentially) faster: it just returns the first geoname string that exceeds the threshold, or None
if it can't find one, so if it doesn't find a match then it still has to search the entire geoname list.
In my improved version, we generate the bigrams for each str1
once, rather than re-generating the bigrams for str1
for each str2
that we compare it with. And we compute all the geoname bigrams in advance, storing them in a dict indexed by the string so that we don't have to regenerate them for each str
. Also, we store the geoname bigrams as sets. That makes computing the hit_count
much faster, since set membership testing is much faster than doing a linear scan over a list of strings. The geodict
also needs to store the length of each bigram: a set contains no duplicate items, so the length of the set of bigrams may be smaller than the list of bigrams, but we need the list length to compute the score correctly.
# Some fake data
geonames = [
'Slettmarkmountains Jotunheimen Norway',
'Fairy Glen Skye Scotland UK',
'Emigrant Wilderness California',
'Yosemite National Park',
'Half Dome Yosemite National Park',
]
mynames = [
'Jotunheimen Norway',
'Fairy Glen',
'Slettmarkmountains Jotunheimen Norway',
'Bryce Canyon',
'Half Dome',
]
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in range(len(s) - 1)]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
break
return (2.0 * hit_count) / union
# Find the string in geonames which is the best match to str1
def best_match(str1, thresh=0.2):
score, str2 = max((string_similarity(str1, str2), str2) for str2 in geonames)
if score < thresh:
str2 = None
return score, str2
# Find the 1st string in geonames that matches str1 with a score >= thresh
def first_match(str1, thresh=0.2):
for str2 in geonames:
score = string_similarity(str1, str2)
if score >= thresh:
return score, str2
return None
print('Best')
for mystr in mynames:
print(mystr, ':', best_match(mystr))
print()
print('First')
for mystr in mynames:
print(mystr, ':', best_match(mystr))
print()
# Put all the geoname bigrams into a dict
geodict = {}
for s in geonames:
bigrams = get_bigrams(s)
geodict[s] = (set(bigrams), len(bigrams))
def new_best_match(str1, thresh=0.2):
pairs1 = get_bigrams(str1)
pairs1_len = len(pairs1)
score, str2 = max((2.0 * sum(x in pairs2 for x in pairs1) / (pairs1_len + pairs2_len), str2)
for str2, (pairs2, pairs2_len) in geodict.items())
if score < thresh:
str2 = None
return score, str2
def new_first_match(str1, thresh=0.2):
pairs1 = get_bigrams(str1)
pairs1_len = len(pairs1)
for str2, (pairs2, pairs2_len) in geodict.items():
score = 2.0 * sum(x in pairs2 for x in pairs1) / (pairs1_len + pairs2_len)
if score >= thresh:
return score, str2
return None
print('New Best')
for mystr in mynames:
print(mystr, ':', new_best_match(mystr))
print()
print('New First')
for mystr in mynames:
print(mystr, ':', new_first_match(mystr))
print()
output
Best
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : (0.1875, None)
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')
First
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : (0.1875, None)
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')
New Best
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : (0.1875, None)
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')
New First
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : None
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')
new_first_match
is fairly straight-forward. The line
for str2, (pairs2, pairs2_len) in geodict.items():
loops over every item in geodict
extracting each string, bigram set and true bigram length.
sum(x in pairs2 for x in pairs1)
counts how many of the bigrams in pairs1
are members of the pairs2
set.
So for each geoname string, we compute the similarity score and return it if it's >= the threshold, which has a default value of 0.2. You can give it a different default thresh
, or pass a thresh
when you call it.
new_best_match
is a little more complicated. ;)
((2.0 * sum(x in pairs2 for x in pairs1) / (pairs1_len + pairs2_len), str2)
for str2, (pairs2, pairs2_len) in geodict.items())
is a generator expression. It loops over the geodict
items and creates a (score, str2)
tuple for each geoname string. We then feed that generator expression to the max
function, which returns the tuple with the highest score.
Here's a version of new_first_match
that implements the suggestion that juvian made in the comments. It may save a little bit of time. This version also avoids testing if either bigram is empty.
def new_first_match(str1, thresh=0.2):
pairs1 = get_bigrams(str1)
pairs1_len = len(pairs1)
if not pairs1_len:
return None
hiscore = 0
for str2, (pairs2, pairs2_len) in geodict.items():
if not pairs2_len:
continue
total_len = pairs1_len + pairs2_len
bound = 2.0 * pairs1_len / total_len
if bound >= hiscore:
score = 2.0 * sum(x in pairs2 for x in pairs1) / total_len
if score >= thresh:
return score, str2
hiscore = max(hiscore, score)
return None
A simpler variation is to not bother computing hiscore
& just compare bound
to thresh
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With