I have two dataframes currently, one for donors
and one for fundraisers
. I'm trying to find if any fundraisers
also gave donations, and if so, copy some of that information into my fundraiser
dataset (donor name, email and their first donation). Problems with my data are:
donors
and fundraisers
:
Sample code I have right now:
import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib
donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": (["27/03/2013 10:00:00 AM","1/03/2013 10:39:00 AM","2/03/2013 10:39:00 AM","3/03/2013 10:39:00 AM","4/03/2013 10:39:00 AM","27/03/2013 10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": pd.Series(["2/03/2013 10:39:00 AM","27/03/2013 11:39:00 AM","3/03/2013 10:39:00 AM","4/03/2013 10:40:00 AM","27/03/2013 10:39:00 AM"])})
donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)
So this leaves me with the first donation by each donor (assuming anyone with the exact same name and email is the same person).
Ideally I want my fundraisers
dataset to look like:
Date Email name Donor Name Donor Email Donor Date
2013-03-27 10:00:00 [email protected] John Doe John Doe [email protected] 2013-03-27 10:00:00
2013-01-03 10:39:00 [email protected] John Doe John Doe [email protected] 2013-03-27 10:00:00
2013-02-03 10:39:00 [email protected] Kathy test Kat test [email protected] 2013-03-27 10:39:00
2013-03-03 10:39:00 [email protected] Tes Ester
2013-04-03 10:39:00 [email protected] Jane Doe Jane Doe [email protected] 2013-04-03 10:39:00
I tried following this thread: is it possible to do fuzzy match merge with python pandas? but keep getting index out of range errors (guessing it doesn't like the duplicated names in fundraisers) :( So any ideas how I can match/merge these datasets?
doing it with for loops (which works but is super slow and I feel there has to be a better way)
Code:
fundraisers["donor name"] = ""
fundraisers["donor email"] = ""
fundraisers["donor date"] = ""
for donindex in range(len(donors.index)):
max = 75
for funindex in range(len(fundraisers.index)):
aname = donors["name"][donindex]
comp = fundraisers["name"][funindex]
ratio = fuzz.ratio(aname, comp)
if ratio > max:
if (donors["Email"][donindex] == fundraisers["Email"][funindex]):
ratio *= 2
max = ratio
fundraisers["donor name"][funindex] = aname
fundraisers["donor email"][funindex] = donors["Email"][donindex]
fundraisers["donor date"][funindex] = donors["Date"][donindex]
Often you may want to join together two datasets in pandas based on imperfectly matching strings. This is called fuzzy matching. The easiest way to perform fuzzy matching in pandas is to use the get_close_matches() function from the difflib package.
In this approach to prevent duplicated columns from joining the two data frames, the user needs simply needs to use the pd. merge() function and pass its parameters as they join it using the inner join and the column names that are to be joined on from left and right data frames in python.
Index objects are not required to be unique; you can have duplicate row or column labels.
Here's a bit more pythonic (in my view), working (on your example) code, without explicit loops:
def get_donors(row):
d = donors.apply(lambda x: fuzz.ratio(x['name'], row['name']) * 2 if row['Email'] == x['Email'] else 1, axis=1)
d = d[d >= 75]
if len(d) == 0:
v = ['']*3
else:
v = donors.ix[d.idxmax(), ['name','Email','Date']].values
return pd.Series(v, index=['donor name', 'donor email', 'donor date'])
pd.concat((fundraisers, fundraisers.apply(get_donors, axis=1)), axis=1)
Output:
Date Email name donor name donor email donor date
0 2013-03-27 10:00:00 [email protected] John Doe John Doe [email protected] 2013-03-01 10:39:00
1 2013-03-01 10:39:00 [email protected] John Doe John Doe [email protected] 2013-03-01 10:39:00
2 2013-03-02 10:39:00 [email protected] Kathy test Kat test [email protected] 2013-03-27 10:39:00
3 2013-03-03 10:39:00 [email protected] Tes Ester
4 2013-03-04 10:39:00 [email protected] Jane Doe Jane Doe [email protected] 2013-03-04 10:39:00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With