Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace comma-separated values in a dataframe with values from another dataframe

this is my first question on StackOverflow, so please pardon if I am not clear enough. I usually find my answers here but this time I had no luck. Maybe I am being dense, but here we go.

I have two pandas dataframes formatted as follows

df1

+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2        | Descr 1     |
| 3          | Descr 2     |
| 2,3,5      | Descr 3     |
+------------+-------------+

df2

+--------+--------------+
| Ref_ID |   ShortRef   |
+--------+--------------+
|      1 | Smith (2006) |
|      2 | Mike (2009)  |
|      3 | John (2014)  |
|      4 | Cole (2007)  |
|      5 | Jill (2019)  |
|      6 | Tom (2007)   |
+--------+--------------+

Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1

What I would like to do is to replace values in the References field in df1 so it looks like this:

+-------------------------------------+-------------+
|             References              | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009)           | Descr 1     |
| John (2014)                         | Descr 2     |
| Mike (2009);John (2014);Jill (2019) | Descr 3     |
+-------------------------------------+-------------+

So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly Pandas - Replacing Values by Looking Up in an Another Dataframe

But I cannot get my mind around this slightly different problem. The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.

This would be, I am afraid, very slow as I have ca. 2000 unique Ref_IDs and I have to repeat this operation in several columns similar to the References one.

Anyone is willing to point me in the right direction?

Many thanks in advance.

like image 357
Alessio Rovere Avatar asked May 20 '26 03:05

Alessio Rovere


1 Answers

Let's try this:

df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
                                                       'Mike (2009)',
                                                       'John (2014)',
                                                       'Cole (2007)',
                                                       'Jill (2019)',
                                                       'Tom (2007)']})

df1['Reference2'] = (df1['Reference'].str.split(',')
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index('Ref_ID')['ShortRef'])
                                     .groupby(level=0).agg(list))

Output:

  Reference Description                                Reference2
0       1,2     Descr 1               [Smith (2006), Mike (2009)]
1         3     Descr 2                             [John (2014)]
2     1,3,5     Descr 3  [Smith (2006), John (2014), Jill (2019)]

@Datanovice thanks for the update.

df1['Reference2'] = (df1['Reference'].str.split(',')
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index('Ref_ID')['ShortRef'])
                                     .groupby(level=0).agg(';'.join))

Output:

  Reference Description                            Reference2
0       1,2     Descr 1              Smith (2006);Mike (2009)
1         3     Descr 2                           John (2014)
2     1,3,5     Descr 3  Smith (2006);John (2014);Jill (2019)
like image 70
Scott Boston Avatar answered May 22 '26 17:05

Scott Boston



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!