Simplified dfs:
df = pd.DataFrame(
{
"ID": [6, 2, 4],
"to ignore": ["foo", "whatever", "idk"],
"value": ["A", "B", "A"],
}
)
df2 = pd.DataFrame(
{
"ID_number": [1, 2, 3, 4, 5, 6],
"A": [0.91, 0.42, 0.85, 0.84, 0.81, 0.88],
"B": [0.11, 0.22, 0.45, 0.38, 0.01, 0.18],
}
)
ID to ignore value
0 6 foo A
1 2 whatever B
2 4 idk A
A B ID_number
0 0.91 0.11 1
1 0.42 0.22 2
2 0.85 0.45 3
3 0.84 0.38 4
4 0.81 0.01 5
5 0.88 0.18 6
I want to add a column to df
which includes combinations of df['ID']
to df2['ID_number']
and df['value']
to the df2
column matching the value in df[value]
(either 'A' or 'B').
We can add a column of matching values where the lookup column name in df2
is given, 'A':
df["NewCol"] = df["ID"].map(
df2.drop_duplicates("ID_number").set_index("ID_number")["A"]
)
Which gives:
ID to ignore value NewCol
0 6 foo A 0.88
1 2 whatever B 0.42
2 4 idk A 0.84
But this doesn't give values for B, so the value '0.42' above when looking for 'B' should instead be '0.22'.
df["NewCol"] = df["ID"].map(
df2.drop_duplicates("ID_number").set_index("ID_number")[df["value"]]
)
obviously doesn't work. How can do I this?
You can set ID_number
as index in df2
,then use pd.Index.get_indexer
here.
df2 = df2.set_index('ID_number')
r = df2.index.get_indexer(df['ID'])
c = df2.columns.get_indexer(df['value'])
df['new_col'] = df2.values[r, c]
df
ID to ignore value new_col
0 6 foo A 0.88
1 2 whatever B 0.22
2 4 idk A 0.84
Benchmarked using the below setup:
Tested on Ubuntu 20.04.1 LTS(focal), Cpython3.8.5, Ipython shell(7.18.1), pandas(1.1.4), numpy(1.19.2)
Setup
df2 = pd.DataFrame(
{
"ID_number": np.arange(1, 1_000_000 + 1),
"A": np.random.rand(1_000_000),
"B": np.random.rand(1_000_000),
}
)
df = pd.DataFrame(
{
"ID": np.random.randint(1, 1_000_000, 50_000),
"to ignore": ["anything"] * 50_000,
"value": np.random.choice(["A", "B"], 50_000),
}
)
Resutls:
@Vaishali
In [57]: %%timeit
...: mapper = df2.set_index('ID_number').to_dict('index')
...: df['NewCol'] = df.apply(lambda x: mapper[x['ID']][x['value']], axis =
...: 1)
...:
...:
2.09 s ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@Ch3steR
In [58]: %%timeit
...: t = df2.set_index('ID_number')
...: r = t.index.get_indexer(df['ID'])
...: c = t.columns.get_indexer(df['value'])
...: df['new_col'] = df2.values[r, c]
...:
...:
49.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
@Mayank
In [59]: %%timeit
...: x = df2.set_index('ID_number').stack()
...: y = df.set_index(['ID', 'value'])
...: y['NewCol'] = y.index.to_series().map(x.to_dict())
...: y.reset_index(inplace=True)
...:
...:
3.41 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@Jezrael
In [60]: %%timeit
...: df11 = (df2.melt('ID_number', value_name='NewCol', var_name='value')
...: .drop_duplicates(['ID_number','value'])
...: .rename(columns={'ID_number':'ID'}))
...: df.merge(df11, on=['ID','value'], how='left')
...:
...:
693 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With