Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: join on partial string match, like Excel VLOOKUP

I am trying to perform an action in Python which is very similar to VLOOKUP in Excel. There have been many questions related to this on StackOverflow but they are all slightly different from this use case. Hopefully anyone can guide me in the right direction. I have the following two pandas dataframes:

df1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
                    'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
                    'Type': ['01', '03', '04', '02'],
                    'Amount': ['150', '175', '160', '180'],
                    'Comment': ['bla', 'bla', 'bla', 'bla']})

print(df1)
    Invoice Currency
0   20561   EUR
1   20562   EUR
2   20563   EUR
3   20564   USD

print(df2)
    Ref         Type    Amount  Comment
0   20561       01      150     bla
1   INV20562    03      175     bla
2   INV20563BG  04      160     bla
3   20564       02      180     bla

Now I would like to create a new dataframe (df3) where I combine the two based on the invoice numbers. The problem is that the invoice numbers are not always a "full match", but sometimes a "partial match" in df2['Ref']. So the joining on 'Invoice' does not give the desired output because it doesn't copy the data for invoices 20562 & 20563, see below:

df3 = df1.join(df2.set_index('Ref'), on='Invoice')

print(df3)
    Invoice Currency    Type    Amount  Comment
0   20561   EUR         01       150    bla
1   20562   EUR         NaN      NaN    NaN
2   20563   EUR         NaN      NaN    NaN
3   20564   USD         02       180    bla

Is there a way to join on a partial match? I know how to "clean" df2['Ref'] with regex, but that is not the solution I am after. With a for loop, I get a long way but this isn't very Pythonic.

df4 = df1.copy()
for i, row in df1.iterrows():
    tmp = df2[df2['Ref'].str.contains(row['Invoice'])]
    df4.loc[i, 'Amount'] = tmp['Amount'].values[0]

print(df4)
Invoice     Currency    Amount
0   20561   EUR         150
1   20562   EUR         175
2   20563   EUR         160
3   20564   USD         180

Can str.contains() somehow be used in a more elegant way? Thank you so much in advance for your help!

like image 690
PythonSherpa Avatar asked Jun 22 '18 08:06

PythonSherpa


People also ask

What is the equivalent of Vlookup in pandas?

In Pandas, VLOOKUP merges two DataFrames if both have a common attribute (column). You can perform VLOOKUP in Pandas using map() and merge() methods as discussed in this article: The map() method. The merge() method.

Is it possible to do fuzzy match merge with Python pandas?

Often you may want to join together two datasets in pandas based on imperfectly matching strings. This is called fuzzy matching. The easiest way to perform fuzzy matching in pandas is to use the get_close_matches() function from the difflib package.

How do I join strings in pandas?

Pandas str.cat() is used to concatenate strings to the passed caller series of string. Distinct values from a different series can be passed but the length of both the series has to be same. . str has to be prefixed to differentiate it from the Python's default method.

Is join or merge faster pandas?

The Fastest Ways As it turns out, join always tends to perform well, and merge will perform almost exactly the same given the syntax is optimal.


2 Answers

Here are two alternative solutions, both using Pandas' merge.

# Solution 1 (checking directly if 'Invoice' string is in the 'Ref' string)
df4 = df2.copy()
df4['Invoice'] = [val for idx, val in enumerate(df1['Invoice']) if val in df2['Ref'][idx]]
df_m4 = df1.merge(df4[['Amount', 'Invoice']], on='Invoice')

# Solution 2 (regex)
import re
df5 = df2.copy()
df5['Invoice'] = [re.findall(r'(\d{5})', s)[0] for s in df2['Ref']]
df_m5 = df1.merge(df5[['Amount', 'Invoice']], on='Invoice')

Both df_m4 and df_m5 will print

  Currency Invoice Amount
0      EUR   20561    150
1      EUR   20562    175
2      EUR   20563    160
3      USD   20564    180

Note: The regex solution presented assumes that the invoice numbers are always 5 digits and only takes the first of such occurrences. Solution 1 is more robust, as it directly compares the strings. The regex solution could be improved to be more robust if needed though.

like image 34
Tim Skov Jacobsen Avatar answered Sep 18 '22 20:09

Tim Skov Jacobsen


This is one way using pd.Series.apply, which is just a thinly veiled loop. A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form.

df4 = df1.copy()

def get_amount(x):
    return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]

df4['Amount'] = df4['Invoice'].apply(get_amount)

print(df4)

  Currency Invoice Amount
0      EUR   20561    150
1      EUR   20562    175
2      EUR   20563    160
3      USD   20564    180
like image 119
jpp Avatar answered Sep 18 '22 20:09

jpp