Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: get the min value between 2 dataframe columns

I have 2 columns and I want a 3rd column to be the minimum value between them. My data looks like this:

   A  B
0  2  1
1  2  1
2  2  4
3  2  4
4  3  5
5  3  5
6  3  6
7  3  6

And I want to get a column C in the following way:

   A  B   C
0  2  1   1
1  2  1   1
2  2  4   2
3  2  4   2
4  3  5   3
5  3  5   3
6  3  6   3
7  3  6   3

Some helping code:

df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3],
                   'B': [1, 1, 4, 4, 5, 5, 6, 6]})

Thanks!

like image 341
Adrian Avatar asked Apr 12 '19 14:04

Adrian


People also ask

How do I get the minimum of two columns in pandas?

Min value between two pandas columns You can do so by using the pandas min() function twice.

How do you find the minimum value in pandas?

Pandas DataFrame min() MethodThe min() method returns a Series with the minimum value of each column. By specifying the column axis ( axis='columns' ), the max() method searches column-wise and returns the minimum value for each row.

How do you get a minimum value of a column from a DataFrame in Python?

To get the minimum value of a single column call the min() function by selecting single column from dataframe i.e.

How do you find the relationship between two columns in pandas?

Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.


1 Answers

Use df.min(axis=1)

df['c'] = df.min(axis=1)
df
Out[41]: 
   A  B  c
0  2  1  1
1  2  1  1
2  2  4  2
3  2  4  2
4  3  5  3
5  3  5  3
6  3  6  3
7  3  6  3

This returns the min row-wise (when passing axis=1)

For non-heterogenous dtypes and large dfs you can use numpy.min which will be quicker:

In[42]:
df['c'] = np.min(df.values,axis=1)
df

Out[42]: 
   A  B  c
0  2  1  1
1  2  1  1
2  2  4  2
3  2  4  2
4  3  5  3
5  3  5  3
6  3  6  3
7  3  6  3

timings:

In[45]:
df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3],
                   'B': [1, 1, 4, 4, 5, 5, 6, 6]})
df = pd.concat([df]*1000, ignore_index=True)
df.shape

Out[45]: (8000, 2)

So for a 8K row df:

%timeit df.min(axis=1)
%timeit np.min(df.values,axis=1)
314 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
34.4 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

You can see that the numpy version is nearly 10x quicker (note I pass df.values so we pass a numpy array), this will become more of a factor when we get to even larger dfs

Note

for versions 0.24.0 or greater, use to_numpy()

so the above becomes:

df['c'] = np.min(df.to_numpy(),axis=1)

Timings:

%timeit df.min(axis=1)
%timeit np.min(df.values,axis=1)
%timeit np.min(df.to_numpy(),axis=1)
314 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
35.2 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.5 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

There is a minor discrepancy between .values and to_numpy(), it depends on whether you know upfront that the dtype is not mixed, and that the likely dtype is a factor e.g. float 16 vs float 32 see that link for further explanation. Pandas is doing a little more checking when calling to_numpy

like image 139
EdChum Avatar answered Oct 13 '22 21:10

EdChum