I have 2 columns and I want a 3rd column to be the minimum value between them. My data looks like this:
A B
0 2 1
1 2 1
2 2 4
3 2 4
4 3 5
5 3 5
6 3 6
7 3 6
And I want to get a column C in the following way:
A B C
0 2 1 1
1 2 1 1
2 2 4 2
3 2 4 2
4 3 5 3
5 3 5 3
6 3 6 3
7 3 6 3
Some helping code:
df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3],
'B': [1, 1, 4, 4, 5, 5, 6, 6]})
Thanks!
Min value between two pandas columns You can do so by using the pandas min() function twice.
Pandas DataFrame min() MethodThe min() method returns a Series with the minimum value of each column. By specifying the column axis ( axis='columns' ), the max() method searches column-wise and returns the minimum value for each row.
To get the minimum value of a single column call the min() function by selecting single column from dataframe i.e.
Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.
Use df.min(axis=1)
df['c'] = df.min(axis=1)
df
Out[41]:
A B c
0 2 1 1
1 2 1 1
2 2 4 2
3 2 4 2
4 3 5 3
5 3 5 3
6 3 6 3
7 3 6 3
This returns the min row-wise (when passing axis=1
)
For non-heterogenous dtypes and large dfs you can use numpy.min
which will be quicker:
In[42]:
df['c'] = np.min(df.values,axis=1)
df
Out[42]:
A B c
0 2 1 1
1 2 1 1
2 2 4 2
3 2 4 2
4 3 5 3
5 3 5 3
6 3 6 3
7 3 6 3
timings:
In[45]:
df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3],
'B': [1, 1, 4, 4, 5, 5, 6, 6]})
df = pd.concat([df]*1000, ignore_index=True)
df.shape
Out[45]: (8000, 2)
So for a 8K row df:
%timeit df.min(axis=1)
%timeit np.min(df.values,axis=1)
314 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
34.4 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can see that the numpy version is nearly 10x quicker (note I pass df.values
so we pass a numpy array), this will become more of a factor when we get to even larger dfs
Note
for versions 0.24.0
or greater, use to_numpy()
so the above becomes:
df['c'] = np.min(df.to_numpy(),axis=1)
Timings:
%timeit df.min(axis=1)
%timeit np.min(df.values,axis=1)
%timeit np.min(df.to_numpy(),axis=1)
314 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
35.2 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.5 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
There is a minor discrepancy between .values
and to_numpy()
, it depends on whether you know upfront that the dtype is not mixed, and that the likely dtype is a factor e.g. float 16
vs float 32
see that link for further explanation. Pandas is doing a little more checking when calling to_numpy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With