Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding df column finding matching values in another df for both indexed values and a dynamic source column?

Simplified dfs:

df = pd.DataFrame(
    {
        "ID": [6, 2, 4],
        "to ignore": ["foo", "whatever", "idk"],
        "value": ["A", "B", "A"],
    }
)

df2 = pd.DataFrame(
    {
        "ID_number": [1, 2, 3, 4, 5, 6],
        "A": [0.91, 0.42, 0.85, 0.84, 0.81, 0.88],
        "B": [0.11, 0.22, 0.45, 0.38, 0.01, 0.18],
    }
)

   ID to ignore value
0   6       foo     A
1   2  whatever     B
2   4       idk     A

   A     B  ID_number
0  0.91  0.11          1
1  0.42  0.22          2
2  0.85  0.45          3
3  0.84  0.38          4
4  0.81  0.01          5
5  0.88  0.18          6

I want to add a column to df which includes combinations of df['ID'] to df2['ID_number'] and df['value'] to the df2 column matching the value in df[value] (either 'A' or 'B').

We can add a column of matching values where the lookup column name in df2 is given, 'A':

df["NewCol"] = df["ID"].map(
    df2.drop_duplicates("ID_number").set_index("ID_number")["A"]
)

Which gives:

   ID to ignore value  NewCol
0   6       foo     A    0.88
1   2  whatever     B    0.42
2   4       idk     A    0.84

But this doesn't give values for B, so the value '0.42' above when looking for 'B' should instead be '0.22'.

df["NewCol"] = df["ID"].map(
    df2.drop_duplicates("ID_number").set_index("ID_number")[df["value"]]
)

obviously doesn't work. How can do I this?

like image 484
Chris Dixon Avatar asked Dec 01 '20 05:12

Chris Dixon


1 Answers

You can set ID_number as index in df2,then use pd.Index.get_indexer here.

df2 = df2.set_index('ID_number')
r = df2.index.get_indexer(df['ID'])
c = df2.columns.get_indexer(df['value'])
df['new_col'] = df2.values[r, c]
df

   ID to ignore value  new_col
0   6       foo     A     0.88
1   2  whatever     B     0.22
2   4       idk     A     0.84

Timeits

Benchmarked using the below setup:

Tested on Ubuntu 20.04.1 LTS(focal), Cpython3.8.5, Ipython shell(7.18.1), pandas(1.1.4), numpy(1.19.2)

Setup

df2 = pd.DataFrame(
    {
        "ID_number": np.arange(1, 1_000_000 + 1),
        "A": np.random.rand(1_000_000),
        "B": np.random.rand(1_000_000),
    }
)

df = pd.DataFrame(
    {
        "ID": np.random.randint(1, 1_000_000, 50_000),
        "to ignore": ["anything"] * 50_000,
        "value": np.random.choice(["A", "B"], 50_000),
    }
)

Resutls:

@Vaishali
In [57]: %%timeit
    ...: mapper = df2.set_index('ID_number').to_dict('index')
    ...: df['NewCol'] = df.apply(lambda x: mapper[x['ID']][x['value']], axis =
    ...: 1)
    ...: 
    ...: 
2.09 s ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Ch3steR
In [58]: %%timeit
    ...: t = df2.set_index('ID_number')
    ...: r = t.index.get_indexer(df['ID'])
    ...: c = t.columns.get_indexer(df['value'])
    ...: df['new_col'] = df2.values[r, c]
    ...: 
    ...: 
49.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Mayank
In [59]: %%timeit
    ...: x = df2.set_index('ID_number').stack()
    ...: y = df.set_index(['ID', 'value'])
    ...: y['NewCol'] = y.index.to_series().map(x.to_dict())
    ...: y.reset_index(inplace=True)
    ...: 
    ...: 
3.41 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Jezrael
In [60]: %%timeit
    ...: df11 = (df2.melt('ID_number', value_name='NewCol', var_name='value')
    ...:            .drop_duplicates(['ID_number','value'])
    ...:            .rename(columns={'ID_number':'ID'}))
    ...: df.merge(df11, on=['ID','value'], how='left')
    ...: 
    ...: 
693 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 178
Ch3steR Avatar answered Sep 18 '22 15:09

Ch3steR