I have 2 dataframes and I want to take one of the columns from one and create a new column in the second based on values in multiple (other) columns First dataframe (<code>df1</code>): <pre class="prettyprint"><code>df1 = pd.DataFrame({'cond': np.repeat([1,2], 5), 'point': np.tile(np.arange(1,6), 2), 'value1': np.random.rand(10), 'unused1': np.random.rand(10)}) cond point unused1 value1 0 1 1 0.923699 0.103046 1 1 2 0.046528 0.188408 2 1 3 0.677052 0.481349 3 1 4 0.464000 0.807454 4 1 5 0.180575 0.962032 5 2 1 0.941624 0.437961 6 2 2 0.489738 0.026166 7 2 3 0.739453 0.109630 8 2 4 0.338997 0.415101 9 2 5 0.310235 0.660748 </code></pre> and the second (<code>df2</code>): <pre class="prettyprint"><code>df2 = pd.DataFrame({'cond': np.repeat([1,2], 10), 'point': np.tile(np.arange(1,6), 4), 'value2': np.random.rand(20)}) cond point value2 0 1 1 0.990252 1 1 2 0.534813 2 1 3 0.407325 3 1 4 0.969288 4 1 5 0.085832 5 1 1 0.922026 6 1 2 0.567615 7 1 3 0.174402 8 1 4 0.469556 9 1 5 0.511182 10 2 1 0.219902 11 2 2 0.761498 12 2 3 0.406981 13 2 4 0.551322 14 2 5 0.727761 15 2 1 0.075048 16 2 2 0.159903 17 2 3 0.726013 18 2 4 0.848213 19 2 5 0.284404 </code></pre> <code>df1['value1']</code> contains values for each combination of <code>cond</code> and <code>point</code>. I want to create a new column (<code>new_column</code>) in <code>df2</code> that contains values from <code>df1['value1']</code>, but the values should be the ones where <code>cond</code> and <code>point</code> are matching across the 2 dataframes. So my desired output looks like this: <pre class="prettyprint"><code> cond point value2 new_column 0 1 1 0.990252 0.103046 1 1 2 0.534813 0.188408 2 1 3 0.407325 0.481349 3 1 4 0.969288 0.807454 4 1 5 0.085832 0.962032 5 1 1 0.922026 0.103046 6 1 2 0.567615 0.188408 7 1 3 0.174402 0.481349 8 1 4 0.469556 0.807454 9 1 5 0.511182 0.962032 10 2 1 0.219902 0.437961 11 2 2 0.761498 0.026166 12 2 3 0.406981 0.109630 13 2 4 0.551322 0.415101 14 2 5 0.727761 0.660748 15 2 1 0.075048 0.437961 16 2 2 0.159903 0.026166 17 2 3 0.726013 0.109630 18 2 4 0.848213 0.415101 19 2 5 0.284404 0.660748 </code></pre> In this example I could just use tile/repeat, but in reality <code>df1['value1']</code> doesn't fit so neatly into the other dataframe. So I just need to do it based on matching the <code>cond</code> and <code>point</code> columns I've tried merging them, but 1) the numbers dont seem to match and 2) I dont want to bring over any unused columns from <code>df1</code>: <code>df1.merge(df2, left_on=['cond', 'point'], right_on=['cond', 'point'])</code> Whats the correct way to add this new column without having to iterate through the 2 dataframes?

Option 1 For grace and speed with pure <code>pandas</code>, we can use <code>lookup</code> This will produce the same output as all other options, seen below. The concept is to represent the lookup data as a 2-D array and lookup values with the indices. <pre class="prettyprint"><code>d1 = df1.set_index(['cond', 'point']).value1.unstack() df2.assign(new_column=d1.lookup(df2.cond, df2.point)) </code></pre> Option 2 We can do the same thing with <code>numpy</code> to improve performance if the values are presented in the same way they are in <code>df1</code>. This is very fast! <pre class="prettyprint"><code>a = df1.value1.values.reshape(2, -1) df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1]) </code></pre> Option 3 The canonical answer is to use <code>merge</code> with the <code>left</code> parameter But we'll need to prep <code>df1</code> a bit to nail the output <pre class="prettyprint"><code>d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'}) df2.merge(d1, 'left') </code></pre> Option 4 I thought this was fun. Build a mapping dictionary and a series to map on Good for small data, not so good for large data. See timing below. <pre class="prettyprint"><code>c1 = df1.cond.values.tolist() p1 = df1.point.values.tolist() v1 = df1.value1.values.tolist() m = {(c, p): v for c, p, v in zip(c1, p1, v1)} c2 = df2.cond.values.tolist() p2 = df2.point.values.tolist() i2 = df2.index.values.tolist() s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)}) df2.assign(new_column=s2.map(m)) </code></pre> <hr> OUTPUT <pre class="prettyprint"><code> cond point value2 new_column 0 1 1 0.990252 0.103046 1 1 2 0.534813 0.188408 2 1 3 0.407325 0.481349 3 1 4 0.969288 0.807454 4 1 5 0.085832 0.962032 5 1 1 0.922026 0.103046 6 1 2 0.567615 0.188408 7 1 3 0.174402 0.481349 8 1 4 0.469556 0.807454 9 1 5 0.511182 0.962032 10 2 1 0.219902 0.437961 11 2 2 0.761498 0.026166 12 2 3 0.406981 0.109630 13 2 4 0.551322 0.415101 14 2 5 0.727761 0.660748 15 2 1 0.075048 0.437961 16 2 2 0.159903 0.026166 17 2 3 0.726013 0.109630 18 2 4 0.848213 0.415101 19 2 5 0.284404 0.660748 </code></pre> <hr> Timing small data <pre class="prettyprint"><code>%%timeit a = df1.value1.values.reshape(2, -1) df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1]) 1000 loops, best of 3: 304 µs per loop %%timeit d1 = df1.set_index(['cond', 'point']).value1.unstack() df2.assign(new_column=d1.lookup(df2.cond, df2.point)) 100 loops, best of 3: 1.8 ms per loop %%timeit c1 = df1.cond.values.tolist() p1 = df1.point.values.tolist() v1 = df1.value1.values.tolist() m = {(c, p): v for c, p, v in zip(c1, p1, v1)} c2 = df2.cond.values.tolist() p2 = df2.point.values.tolist() i2 = df2.index.values.tolist() s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)}) df2.assign(new_column=s2.map(m)) 1000 loops, best of 3: 719 µs per loop %%timeit d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'}) df2.merge(d1, 'left') 100 loops, best of 3: 2.04 ms per loop %%timeit df = pd.merge(df2, df1.drop('unused1', axis=1), 'left') df.rename(columns={'value1': 'new_column'}) 100 loops, best of 3: 2.01 ms per loop %%timeit df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point']) df.rename(columns={'value1': 'new_column'}) 100 loops, best of 3: 2.15 ms per loop </code></pre> large data <pre class="prettyprint"><code>df2 = pd.concat([df2] * 10000, ignore_index=True) %%timeit a = df1.value1.values.reshape(2, -1) df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1]) 1000 loops, best of 3: 1.93 ms per loop %%timeit d1 = df1.set_index(['cond', 'point']).value1.unstack() df2.assign(new_column=d1.lookup(df2.cond, df2.point)) 100 loops, best of 3: 5.58 ms per loop %%timeit c1 = df1.cond.values.tolist() p1 = df1.point.values.tolist() v1 = df1.value1.values.tolist() m = {(c, p): v for c, p, v in zip(c1, p1, v1)} c2 = df2.cond.values.tolist() p2 = df2.point.values.tolist() i2 = df2.index.values.tolist() s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)}) df2.assign(new_column=s2.map(m)) 10 loops, best of 3: 135 ms per loop %%timeit d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'}) df2.merge(d1, 'left') 100 loops, best of 3: 13.4 ms per loop %%timeit df = pd.merge(df2, df1.drop('unused1', axis=1), 'left') df.rename(columns={'value1': 'new_column'}) 10 loops, best of 3: 19.8 ms per loop %%timeit df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point']) df.rename(columns={'value1': 'new_column'}) 100 loops, best of 3: 18.2 ms per loop </code></pre>

Pandas - merging dataframes conditionally on multiple columns

Tags:

python

merge

pandas

dataframe

I have 2 dataframes and I want to take one of the columns from one and create a new column in the second based on values in multiple (other) columns

First dataframe (df1):

df1 = pd.DataFrame({'cond': np.repeat([1,2], 5),
                    'point': np.tile(np.arange(1,6), 2),
                    'value1': np.random.rand(10),
                    'unused1': np.random.rand(10)})

   cond  point   unused1    value1
0     1      1  0.923699  0.103046
1     1      2  0.046528  0.188408
2     1      3  0.677052  0.481349
3     1      4  0.464000  0.807454
4     1      5  0.180575  0.962032
5     2      1  0.941624  0.437961
6     2      2  0.489738  0.026166
7     2      3  0.739453  0.109630
8     2      4  0.338997  0.415101
9     2      5  0.310235  0.660748

and the second (df2):

df2 = pd.DataFrame({'cond': np.repeat([1,2], 10),
                    'point': np.tile(np.arange(1,6), 4),
                    'value2': np.random.rand(20)})

    cond  point    value2
0      1      1  0.990252
1      1      2  0.534813
2      1      3  0.407325
3      1      4  0.969288
4      1      5  0.085832
5      1      1  0.922026
6      1      2  0.567615
7      1      3  0.174402
8      1      4  0.469556
9      1      5  0.511182
10     2      1  0.219902
11     2      2  0.761498
12     2      3  0.406981
13     2      4  0.551322
14     2      5  0.727761
15     2      1  0.075048
16     2      2  0.159903
17     2      3  0.726013
18     2      4  0.848213
19     2      5  0.284404

df1['value1'] contains values for each combination of cond and point.

I want to create a new column (new_column) in df2 that contains values from df1['value1'], but the values should be the ones where cond and point are matching across the 2 dataframes.

So my desired output looks like this:

    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

In this example I could just use tile/repeat, but in reality df1['value1'] doesn't fit so neatly into the other dataframe. So I just need to do it based on matching the cond and point columns

I've tried merging them, but 1) the numbers dont seem to match and 2) I dont want to bring over any unused columns from df1:

df1.merge(df2, left_on=['cond', 'point'], right_on=['cond', 'point'])

Whats the correct way to add this new column without having to iterate through the 2 dataframes?

235

asked May 07 '17 05:05

Simon

2 Answers

Option 1
For grace and speed with pure pandas, we can use lookup
This will produce the same output as all other options, seen below.

The concept is to represent the lookup data as a 2-D array and lookup values with the indices.

d1 = df1.set_index(['cond', 'point']).value1.unstack()
df2.assign(new_column=d1.lookup(df2.cond, df2.point))

Option 2
We can do the same thing with numpy to improve performance if the values are presented in the same way they are in df1. This is very fast!

a = df1.value1.values.reshape(2, -1)
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1])

Option 3
The canonical answer is to use merge with the left parameter
But we'll need to prep df1 a bit to nail the output

d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'})
df2.merge(d1, 'left')

Option 4
I thought this was fun. Build a mapping dictionary and a series to map on
Good for small data, not so good for large data. See timing below.

c1 = df1.cond.values.tolist()
p1 = df1.point.values.tolist()
v1 = df1.value1.values.tolist()
m = {(c, p): v for c, p, v in zip(c1, p1, v1)}

c2 = df2.cond.values.tolist()
p2 = df2.point.values.tolist()
i2 = df2.index.values.tolist()
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)})

df2.assign(new_column=s2.map(m))

OUTPUT

    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

Timing
small data

%%timeit 
a = df1.value1.values.reshape(2, -1)
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1])
1000 loops, best of 3: 304 µs per loop

%%timeit
d1 = df1.set_index(['cond', 'point']).value1.unstack()
df2.assign(new_column=d1.lookup(df2.cond, df2.point))
100 loops, best of 3: 1.8 ms per loop

%%timeit
c1 = df1.cond.values.tolist()
p1 = df1.point.values.tolist()
v1 = df1.value1.values.tolist()
m = {(c, p): v for c, p, v in zip(c1, p1, v1)}

c2 = df2.cond.values.tolist()
p2 = df2.point.values.tolist()
i2 = df2.index.values.tolist()
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)})

df2.assign(new_column=s2.map(m))
1000 loops, best of 3: 719 µs per loop

%%timeit
d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'})
df2.merge(d1, 'left')
100 loops, best of 3: 2.04 ms per loop

%%timeit
df = pd.merge(df2, df1.drop('unused1', axis=1), 'left')
df.rename(columns={'value1': 'new_column'})
100 loops, best of 3: 2.01 ms per loop

%%timeit
df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point'])
df.rename(columns={'value1': 'new_column'})
100 loops, best of 3: 2.15 ms per loop

large data

df2 = pd.concat([df2] * 10000, ignore_index=True)

%%timeit 
a = df1.value1.values.reshape(2, -1)
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1])
1000 loops, best of 3: 1.93 ms per loop

%%timeit
d1 = df1.set_index(['cond', 'point']).value1.unstack()
df2.assign(new_column=d1.lookup(df2.cond, df2.point))
100 loops, best of 3: 5.58 ms per loop

%%timeit
c1 = df1.cond.values.tolist()
p1 = df1.point.values.tolist()
v1 = df1.value1.values.tolist()
m = {(c, p): v for c, p, v in zip(c1, p1, v1)}

c2 = df2.cond.values.tolist()
p2 = df2.point.values.tolist()
i2 = df2.index.values.tolist()
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)})

df2.assign(new_column=s2.map(m))
10 loops, best of 3: 135 ms per loop

%%timeit
d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'})
df2.merge(d1, 'left')
100 loops, best of 3: 13.4 ms per loop

%%timeit
df = pd.merge(df2, df1.drop('unused1', axis=1), 'left')
df.rename(columns={'value1': 'new_column'})
10 loops, best of 3: 19.8 ms per loop

%%timeit
df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point'])
df.rename(columns={'value1': 'new_column'})
100 loops, best of 3: 18.2 ms per loop

166

answered Sep 25 '22 07:09

piRSquared

You can use merge with left join and drop for remove unused1 column, last rename column:

Notice: Parameter on can be omit if in both DataFrames are only same columns for join. If more same column names, add on=['cond', 'point'].

df = pd.merge(df2, df1.drop('unused1', axis=1), 'left')
df = df.rename(columns={'value1': 'new_column'})
print (df)
    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

Another solution with join (default left join) with set_index + drop:

df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point'])
df = df.rename(columns={'value1': 'new_column'})
print (df)
    cond  point    value2  new_column
0      1      1  0.990252    0.103046
1      1      2  0.534813    0.188408
2      1      3  0.407325    0.481349
3      1      4  0.969288    0.807454
4      1      5  0.085832    0.962032
5      1      1  0.922026    0.103046
6      1      2  0.567615    0.188408
7      1      3  0.174402    0.481349
8      1      4  0.469556    0.807454
9      1      5  0.511182    0.962032
10     2      1  0.219902    0.437961
11     2      2  0.761498    0.026166
12     2      3  0.406981    0.109630
13     2      4  0.551322    0.415101
14     2      5  0.727761    0.660748
15     2      1  0.075048    0.437961
16     2      2  0.159903    0.026166
17     2      3  0.726013    0.109630
18     2      4  0.848213    0.415101
19     2      5  0.284404    0.660748

answered Sep 26 '22 07:09

jezrael

Related questions
                            
                                Find a root of a function in a given range
                            
                                Calculating factorial using multiple threads in Python
                            
                                Why does nesting "yield from" statements (generator delegation) produce terminating `None` value?
                            
                                Floyd-Warshall algorithm: get the shortest paths
                            
                                run pyspark locally
                            
                                Python pandas convert rows to columns where multiple columns exist [duplicate]
                            
                                Sending email with attached file in Django
                            
                                Django implementation of default value in database
                            
                                How to redirect stderr and stdout into /var/log directory in background process?
                            
                                Concating pandas dataframe
                            
                                Python: How to convert Pyspark column to date type if there are null values
                            
                                Filtering pyspark dataframe if text column includes words in specified list
                            
                                How can I split this string up?
                            
                                What is best practice to create a small database in python?
                            
                                Python data frame Export to csv with Quotation marks (")
                            
                                Rendering a 3D mesh into an image using python
                            
                                Converting svg from Highcharts data into data points
                            
                                Adding gravity to a bouncing ball using vectors
                            
                                Multiply all rows in a Pandas DataFrame by dictionary
                            
                                How to run python-socketio in Thread?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With