Pandas: join on partial string match, like Excel VLOOKUP

Tags:

I am trying to perform an action in Python which is very similar to VLOOKUP in Excel. There have been many questions related to this on StackOverflow but they are all slightly different from this use case. Hopefully anyone can guide me in the right direction. I have the following two pandas dataframes:

df1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
                    'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
                    'Type': ['01', '03', '04', '02'],
                    'Amount': ['150', '175', '160', '180'],
                    'Comment': ['bla', 'bla', 'bla', 'bla']})

print(df1)
    Invoice Currency
0   20561   EUR
1   20562   EUR
2   20563   EUR
3   20564   USD

print(df2)
    Ref         Type    Amount  Comment
0   20561       01      150     bla
1   INV20562    03      175     bla
2   INV20563BG  04      160     bla
3   20564       02      180     bla

Now I would like to create a new dataframe (df3) where I combine the two based on the invoice numbers. The problem is that the invoice numbers are not always a "full match", but sometimes a "partial match" in df2['Ref']. So the joining on 'Invoice' does not give the desired output because it doesn't copy the data for invoices 20562 & 20563, see below:

df3 = df1.join(df2.set_index('Ref'), on='Invoice')

print(df3)
    Invoice Currency    Type    Amount  Comment
0   20561   EUR         01       150    bla
1   20562   EUR         NaN      NaN    NaN
2   20563   EUR         NaN      NaN    NaN
3   20564   USD         02       180    bla

Is there a way to join on a partial match? I know how to "clean" df2['Ref'] with regex, but that is not the solution I am after. With a for loop, I get a long way but this isn't very Pythonic.

df4 = df1.copy()
for i, row in df1.iterrows():
    tmp = df2[df2['Ref'].str.contains(row['Invoice'])]
    df4.loc[i, 'Amount'] = tmp['Amount'].values[0]

print(df4)
Invoice     Currency    Amount
0   20561   EUR         150
1   20562   EUR         175
2   20563   EUR         160
3   20564   USD         180

Can str.contains() somehow be used in a more elegant way? Thank you so much in advance for your help!

690

asked Jun 22 '18 08:06

PythonSherpa

2 Answers

Here are two alternative solutions, both using Pandas' merge.

# Solution 1 (checking directly if 'Invoice' string is in the 'Ref' string)
df4 = df2.copy()
df4['Invoice'] = [val for idx, val in enumerate(df1['Invoice']) if val in df2['Ref'][idx]]
df_m4 = df1.merge(df4[['Amount', 'Invoice']], on='Invoice')

# Solution 2 (regex)
import re
df5 = df2.copy()
df5['Invoice'] = [re.findall(r'(\d{5})', s)[0] for s in df2['Ref']]
df_m5 = df1.merge(df5[['Amount', 'Invoice']], on='Invoice')

Both df_m4 and df_m5 will print

  Currency Invoice Amount
0      EUR   20561    150
1      EUR   20562    175
2      EUR   20563    160
3      USD   20564    180

Note: The regex solution presented assumes that the invoice numbers are always 5 digits and only takes the first of such occurrences. Solution 1 is more robust, as it directly compares the strings. The regex solution could be improved to be more robust if needed though.

answered Sep 18 '22 20:09

Tim Skov Jacobsen

This is one way using pd.Series.apply, which is just a thinly veiled loop. A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form.

df4 = df1.copy()

def get_amount(x):
    return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]

df4['Amount'] = df4['Invoice'].apply(get_amount)

print(df4)

  Currency Invoice Amount
0      EUR   20561    150
1      EUR   20562    175
2      EUR   20563    160
3      USD   20564    180

119

answered Sep 18 '22 20:09

jpp

Related questions
                            
                                Pandas: check if a number appear multiple times in a row
                            
                                How does tensorflow ignore undefined flags
                            
                                types.MethodType third argument in python2
                            
                                Module can't be found when called from outside
                            
                                How to handle multiple results from a coroutine function?
                            
                                pandas Categorical error: "Cannot setitem on a Categorical with a new category, set the categories first"
                            
                                Sentence Structure identification - spacy
                            
                                Changing activation function of a keras layer w/o replacing whole layer
                            
                                Write csv file and save it into S3 using AWS Lambda (python)
                            
                                Anaconda not available in PyCharm
                            
                                Performance of bytearray and alternatives
                            
                                Get regex group with fuzziness
                            
                                0% accuracy with evaluate_generator but 75% accuracy during training with same data - what is going on?
                            
                                ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile
                            
                                Inlining Python Function
                            
                                How to prevent Django 1.11 from creating migrations for unmanaged models?
                            
                                Is there a way to create read only dashboard in Apache Superset
                            
                                Python livewires resize screen
                            
                                venv vs virtualenv - Why does venv not use the current pip and setuptools? [duplicate]
                            
                                Python selenium: selenium.common.exceptions.NoSuchWindowException: Message: Browsing context has been discarded

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: join on partial string match, like Excel VLOOKUP

Tags:

python

join

python-3.x

pandas

dataframe

PythonSherpa

People also ask

2 Answers

Tim Skov Jacobsen

jpp

Recent Activity

Donate For Us