I have 2 dataframes with a partially matching column:
data_a = ['a', 'b', 'c', 'd', 'e', 'f']
data_b = ['b', 'd', 'f', 'h']
df1 = pd.DataFrame(data_a, columns=['data'])
df2 = pd.DataFrame(data_b, columns=['data'])
I would like to get the indices match between the data in df2 to df1.
Something like:
indices = [1, 3, 5, None]
So the 0 index row of df2 matches the 1 index row of df1, the 1 index row of df2 matches the 3 index row of df1, etc...
If no match, then the value is None or any negative value.
It may be assumed there will be a single match (The data columns are unique in their values).
I could do it in a loop, yet it is slow.
I am looking for efficient way to get this or something similar.
Should work as long as values in df1 are unique and hashable --
>>> s1 = df1['data']
>>> s2 = df2['data']
>>> s3 = pd.Series(data=s1.index, index=s1)
>>> s3
data
a 0
b 1
c 2
d 3
e 4
f 5
dtype: int64
>>> s2.map(s3)
0 1.0
1 3.0
2 5.0
3 NaN
Name: data, dtype: float64
>>> s2.map(s3).fillna(-1).astype(int).to_list()
[1, 3, 5, -1]
reset_index, then left-merge:
df2.merge(df1.reset_index(), on='data', how='left')
Output:
data index
0 b 1.0
1 d 3.0
2 f 5.0
3 h NaN
Note that if you have several matches, this will create additional rows:
data_a = ['a', 'b', 'c', 'd', 'b', 'f']
data_b = ['b', 'd', 'f', 'h']
df1 = pd.DataFrame(data_a, columns=['data'])
df2 = pd.DataFrame(data_b, columns=['data'])
out = df2.merge(df1.reset_index(), on='data', how='left')
data index
0 b 1.0
1 b 4.0
2 d 3.0
3 f 5.0
4 h NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With