I have 2 columns and I want a 3rd column to be the minimum value between them. My data looks like this: <pre class="prettyprint"><code> A B 0 2 1 1 2 1 2 2 4 3 2 4 4 3 5 5 3 5 6 3 6 7 3 6 </code></pre> And I want to get a column C in the following way: <pre class="prettyprint"><code> A B C 0 2 1 1 1 2 1 1 2 2 4 2 3 2 4 2 4 3 5 3 5 3 5 3 6 3 6 3 7 3 6 3 </code></pre> Some helping code: <pre class="prettyprint"><code>df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3], 'B': [1, 1, 4, 4, 5, 5, 6, 6]}) </code></pre> Thanks!

Use <code>df.min(axis=1)</code> <pre class="prettyprint"><code>df['c'] = df.min(axis=1) df Out[41]: A B c 0 2 1 1 1 2 1 1 2 2 4 2 3 2 4 2 4 3 5 3 5 3 5 3 6 3 6 3 7 3 6 3 </code></pre> This returns the min row-wise (when passing <code>axis=1</code>) For non-heterogenous dtypes and large dfs you can use <code>numpy.min</code> which will be quicker: <pre class="prettyprint"><code>In[42]: df['c'] = np.min(df.values,axis=1) df Out[42]: A B c 0 2 1 1 1 2 1 1 2 2 4 2 3 2 4 2 4 3 5 3 5 3 5 3 6 3 6 3 7 3 6 3 </code></pre> timings: <pre class="prettyprint"><code>In[45]: df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3], 'B': [1, 1, 4, 4, 5, 5, 6, 6]}) df = pd.concat([df]*1000, ignore_index=True) df.shape Out[45]: (8000, 2) </code></pre> So for a 8K row df: <pre class="prettyprint"><code>%timeit df.min(axis=1) %timeit np.min(df.values,axis=1) 314 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 34.4 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code></pre> You can see that the numpy version is nearly 10x quicker (note I pass <code>df.values</code> so we pass a numpy array), this will become more of a factor when we get to even larger dfs Note for versions <code>0.24.0</code> or greater, use <code>to_numpy()</code> so the above becomes: <pre class="prettyprint"><code>df['c'] = np.min(df.to_numpy(),axis=1) </code></pre> Timings: <pre class="prettyprint"><code>%timeit df.min(axis=1) %timeit np.min(df.values,axis=1) %timeit np.min(df.to_numpy(),axis=1) 314 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 35.2 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 35.5 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code></pre> There is a minor discrepancy between <code>.values</code> and <code>to_numpy()</code>, it depends on whether you know upfront that the dtype is not mixed, and that the likely dtype is a factor e.g. <code>float 16</code> vs <code>float 32</code> see that link for further explanation. Pandas is doing a little more checking when calling <code>to_numpy</code>

Pandas: get the min value between 2 dataframe columns

Tags:

python

python-3.x

pandas

dataframe

min

I have 2 columns and I want a 3rd column to be the minimum value between them. My data looks like this:

And I want to get a column C in the following way:

   A  B   C
0  2  1   1
1  2  1   1
2  2  4   2
3  2  4   2
4  3  5   3
5  3  5   3
6  3  6   3
7  3  6   3

Some helping code:

df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3],
                   'B': [1, 1, 4, 4, 5, 5, 6, 6]})

Thanks!

341

asked Apr 12 '19 14:04

Adrian

1 Answers

Use df.min(axis=1)

df['c'] = df.min(axis=1)
df
Out[41]: 
   A  B  c
0  2  1  1
1  2  1  1
2  2  4  2
3  2  4  2
4  3  5  3
5  3  5  3
6  3  6  3
7  3  6  3

This returns the min row-wise (when passing axis=1)

For non-heterogenous dtypes and large dfs you can use numpy.min which will be quicker:

In[42]:
df['c'] = np.min(df.values,axis=1)
df

Out[42]: 
   A  B  c
0  2  1  1
1  2  1  1
2  2  4  2
3  2  4  2
4  3  5  3
5  3  5  3
6  3  6  3
7  3  6  3

timings:

In[45]:
df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3],
                   'B': [1, 1, 4, 4, 5, 5, 6, 6]})
df = pd.concat([df]*1000, ignore_index=True)
df.shape

Out[45]: (8000, 2)

So for a 8K row df:

%timeit df.min(axis=1)
%timeit np.min(df.values,axis=1)
314 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
34.4 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

You can see that the numpy version is nearly 10x quicker (note I pass df.values so we pass a numpy array), this will become more of a factor when we get to even larger dfs

Note

for versions 0.24.0 or greater, use to_numpy()

so the above becomes:

df['c'] = np.min(df.to_numpy(),axis=1)

Timings:

%timeit df.min(axis=1)
%timeit np.min(df.values,axis=1)
%timeit np.min(df.to_numpy(),axis=1)
314 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
35.2 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.5 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

There is a minor discrepancy between .values and to_numpy(), it depends on whether you know upfront that the dtype is not mixed, and that the likely dtype is a factor e.g. float 16 vs float 32 see that link for further explanation. Pandas is doing a little more checking when calling to_numpy

139

answered Oct 13 '22 21:10

EdChum

Related questions
                            
                                Individual axes limits for pairplot in python
                            
                                Prune unnecessary leaves in sklearn DecisionTreeClassifier
                            
                                Using numpy.vstack in numba
                            
                                K-means using only specific dataframe columns with scikit-learn
                            
                                How to combine multiple rows into a single row with python pandas based on the values of multiple columns?
                            
                                Why is Keras LSTM on CPU three times faster than GPU?
                            
                                Cycling values of a list [duplicate]
                            
                                How to disable pytest dumping out source code?
                            
                                ValueError: must have exactly one of create/read/write/append mode
                            
                                How to run tasks concurrently in asyncio?
                            
                                Numpy get index of row with second-largest value
                            
                                How to handle strange Pandas error "unable to open hashtable..."
                            
                                How do/can I generate a PKCS#12 file using python and the cryptography module?
                            
                                How Can I Update a Qml Object's Property from my Python file?
                            
                                How to pass environment variables from SAM cli to Lambda function code
                            
                                Resume Training tf.keras Tensorboard
                            
                                Poetry manage python package CLI
                            
                                How to find best params of leader model in automl h2o python
                            
                                Different methods for initializing embedding layer weights in Pytorch
                            
                                Numba jit with scipy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With