I am trying to do string match and bring the match id using fuzzy wuzzy in python. My dataset is huge, dataset1 = 1.8 million records, dataset2 = 1.6 million records.
What I tried so far,
First I tried to use record linkage
package in python, unfortunately it ran out of memory when it build the multi index
, so I moved to AWS with good machine power and successfully built it, however when I tried to run the comparison on it, it runs forever, I agree that its due to the number of comparison.
Then, I tried to do string match with fuzzy wuzzy
and parallelise the process using dask
package. And executed it on a sample data. It works fine, but I know the process will still take time as the search space is wide. I am looking for a way to add blocking or indexing on this piece of code.
test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']})
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']})
Here, I am trying to look for test.Address1
in test2.Address1
and bring its ID
.
def fuzzy_score(str1, str2):
return fuzz.token_set_ratio(str1, str2)
def helper(orig_string, slave_df):
slave_df['score'] = slave_df.Address1.apply(lambda x: fuzzy_score(x,orig_string))
#return my_value corresponding to the highest score
return slave_df.ix[slave_df.score.idxmax(),'ID']
dmaster = dd.from_pandas(test, npartitions=24)
dmaster = dmaster.assign(ID_there=dmaster.Address1.apply(lambda x: helper(x, test2)))
dmaster.compute(get=dask.multiprocessing.get)
This works fine, however I am not sure how I can apply indexing on it by limiting the search space on the same city.
Lets say, I am creating an index on the city field and subset based on the city of the original string and pass that city to the helper function,
# sort the dataframe
test2.sort_values(by=['city'], inplace=True)
# set the index to be this and don't drop
test2.set_index(keys=['city'], drop=False,inplace=True)
I don't know how to do that ? Please advise. Thanks in advance.
I prefer using fuzzywuzzy.process.extractOne
. That compares a string to an iterable of strings.
def extract_one(col, other):
# need this for dask later
other = other.compute() if hasattr(other, 'compute') else other
return pd.DataFrame([process.extractOne(x, other) for x in col],
columns=['Address1', 'score', 'idx'],
index=col.index)
extract_one(test.Address1, test2.Address1)
Address1 score idx
0 123 chese wy 92 0
1 234 kookie Pl 83 1
2 345 Pizzza DR 86 2
3 456 Pretzel Junktion 95 3
The idx
is the index of the other
passed to extract_one
that matches closest. I would recommend having a meaningful index, to making joining the results later on easier.
For your second question, about filtering to cities, I would use a groupby and apply
gr1 = test.groupby('city')
gr2 = test2.groupby("city")
gr1.apply(lambda x: extract_one(x.Address1,
gr2.get_group(x.name).Address1))
Address1 score idx
0 123 chese wy 92 0
1 234 kookie Pl 83 1
2 345 Pizzza DR 86 2
3 456 Pretzel Junktion 95 3
The only difference with dask is the need to specify a meta
to the apply:
ddf1 = dd.from_pandas(test, 2)
ddf2 = dd.from_pandas(test2, 2)
dgr1 = ddf1.groupby('city')
dgr2 = ddf2.groupby('city')
meta = pd.DataFrame(columns=['Address1', 'score', 'idx'])
dgr1.apply(lambda x: extract_one(x.Address1,
dgr2.get_group(x.name).Address1),
meta=meta).compute()
Address1 score idx
city
U 0 234 kookie Pl 83 1
1 234 kookie Pl 28 1
X 0 123 chese wy 92 0
1 123 chese wy 28 0
Here's a notebook: https://gist.github.com/a932b3591346b898d6816a5efc2bc5ad
I'm curious to hear how the performance is. I'm assuming the actual string comparison done in fuzzy wuzzy will take the bulk of the time, but I'd love to hear back on how much overhead is spent in pandas and dask. Make sure you have the C extentions for computing the Levenshtein distance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With