Pandas fuzzy merge/match name column, with duplicates

Tags:

I have two dataframes currently, one for donors and one for fundraisers. I'm trying to find if any fundraisers also gave donations, and if so, copy some of that information into my fundraiser dataset (donor name, email and their first donation). Problems with my data are:

I need to match by name and email, but a user might have slightly different names (ex 'Kat' and 'Kathy').
Duplicate names for donors and fundraisers:
- 2a) With donors I can get unique name/email combinations since I just care about the first donation date
- 2b) With fundraisers though I need to keep both rows and not lose data like the date.

Sample code I have right now:

import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib 

donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": (["27/03/2013  10:00:00 AM","1/03/2013  10:39:00 AM","2/03/2013  10:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:39:00 AM","27/03/2013  10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": pd.Series(["2/03/2013  10:39:00 AM","27/03/2013  11:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:40:00 AM","27/03/2013  10:39:00 AM"])})

donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)

donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)

So this leaves me with the first donation by each donor (assuming anyone with the exact same name and email is the same person).

Ideally I want my fundraisers dataset to look like:

Date                Email       name        Donor Name  Donor Email Donor Date
2013-03-27 10:00:00     [email protected]      John Doe    John Doe    [email protected]      2013-03-27 10:00:00 
2013-01-03 10:39:00     [email protected]      John Doe    John Doe    [email protected]      2013-03-27 10:00:00 
2013-02-03 10:39:00     [email protected]      Kathy test  Kat test    [email protected]      2013-03-27 10:39:00 
2013-03-03 10:39:00     [email protected]    Tes Ester   
2013-04-03 10:39:00     [email protected]  Jane Doe    Jane Doe    [email protected]  2013-04-03 10:39:00

I tried following this thread: is it possible to do fuzzy match merge with python pandas? but keep getting index out of range errors (guessing it doesn't like the duplicated names in fundraisers) :( So any ideas how I can match/merge these datasets?
doing it with for loops (which works but is super slow and I feel there has to be a better way)

Code:

fundraisers["donor name"] = ""
fundraisers["donor email"] = ""
fundraisers["donor date"] = ""
for donindex in range(len(donors.index)):
    max = 75
    for funindex in range(len(fundraisers.index)):
        aname = donors["name"][donindex]
        comp = fundraisers["name"][funindex]
        ratio = fuzz.ratio(aname, comp)
        if ratio > max:
            if (donors["Email"][donindex] == fundraisers["Email"][funindex]):
                ratio *= 2
            max = ratio
            fundraisers["donor name"][funindex] = aname
            fundraisers["donor email"][funindex] = donors["Email"][donindex]
            fundraisers["donor date"][funindex] = donors["Date"][donindex]

519

asked Nov 13 '13 21:11

Wizuriel

1 Answers

Here's a bit more pythonic (in my view), working (on your example) code, without explicit loops:

def get_donors(row):
    d = donors.apply(lambda x: fuzz.ratio(x['name'], row['name']) * 2 if row['Email'] == x['Email'] else 1, axis=1)
    d = d[d >= 75]
    if len(d) == 0:
        v = ['']*3
    else:
        v = donors.ix[d.idxmax(), ['name','Email','Date']].values
    return pd.Series(v, index=['donor name', 'donor email', 'donor date'])

pd.concat((fundraisers, fundraisers.apply(get_donors, axis=1)), axis=1)

Output:

                 Date           Email        name donor name     donor email           donor date
0 2013-03-27 10:00:00          [email protected]    John Doe   John Doe          [email protected]  2013-03-01 10:39:00
1 2013-03-01 10:39:00          [email protected]    John Doe   John Doe          [email protected]  2013-03-01 10:39:00
2 2013-03-02 10:39:00          [email protected]  Kathy test   Kat test          [email protected]  2013-03-27 10:39:00
3 2013-03-03 10:39:00    [email protected]   Tes Ester                                                
4 2013-03-04 10:39:00  [email protected]    Jane Doe   Jane Doe  [email protected]  2013-03-04 10:39:00

answered Oct 12 '22 19:10

Roman Pekar

Related questions
                            
                                Creating a model/view interface with sliders using PyQt
                            
                                Wildcard namespaces in lxml
                            
                                Python: .append(0)
                            
                                RGB to YCbCr Conversion problems
                            
                                Submitting a form using Requests Python
                            
                                How do I go about customizing Mezzanine-Cartridge shop/product?
                            
                                Reindexing and filling on one level of a hierarchical index in pandas
                            
                                How to tell scapy sniff() to stop if no packet is received?
                            
                                Non-Blocking error when adding timeout to python server
                            
                                Exposing global data and functions in Pyramid and Jinja 2 templating
                            
                                How does one create a struct from a numba struct type?
                            
                                np.array - too many values to unpack
                            
                                Ping MySQL to keep connection alive in Django
                            
                                hsv_to_rgb isn't the inverse of rgb_to_hsv on matplotlib
                            
                                Pycharm debugger much slower than normal run
                            
                                Does pybtex support accent/special characters in .bib file?
                            
                                python - ensure script is activated only once
                            
                                Parsing python with PLY, how to code the indent and dedent part
                            
                                Installing openCV with anaconda on ubuntu
                            
                                How to remove values on x,y axis on plot in matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas fuzzy merge/match name column, with duplicates

Tags:

python

pandas

dataframe

fuzzy-comparison

fuzzywuzzy

Wizuriel

People also ask

1 Answers

Roman Pekar

Recent Activity

Donate For Us