I have a pretty simple question - I think - but it seems I can't wrap my head around this one. I am a beginner with Python and Pandas. I searched the forum but couldn't get a (recent) answer that fits my need. I have a data frame such as this one: <pre class="prettyprint"><code>df = pd.DataFrame({'A': [1.1, 2.7, 5.3], 'B': [2, 10, 9], 'C': [3.3, 5.4, 1.5], 'D': [4, 7, 15]}, index = ['a1', 'a2', 'a3']) </code></pre> Which gives: <pre class="prettyprint lang-html prettyprint-override"><code> A B C D a1 1.1 2 3.3 4 a2 2.7 10 5.4 7 a3 5.3 9 1.5 15 </code></pre> My question is simple : I would like to add a column that gives the column name of the second max value of each row. I have written a simple function which returns the second max value for each row <pre class="prettyprint"><code>def get_second_best(x): return sorted(x)[-2] df['value'] = df.apply(lambda row: get_second_best(row), axis=1) </code></pre> Which gives: <pre class="prettyprint lang-html prettyprint-override"><code> A B C D value a1 1.1 2 3.3 4 3.3 a2 2.7 10 5.4 7 7.0 a3 5.3 9 1.5 15 9.0 </code></pre> But I can't find how to display the column name in the 'value' column, instead of the value... I'm thinking about boolean indexing (comparing the 'value' column values with each row), but I haven't figured out how to do it. To be clearer, I would like it to be: <pre class="prettyprint lang-html prettyprint-override"><code> A B C D value a1 1.1 2 3.3 4 C a2 2.7 10 5.4 7 D a3 5.3 9 1.5 15 B </code></pre> Any help (and explanation) appreciated!

One approach would be to pick out the two largest elements in each row using <code>Series.nlargest</code> and find the column corresponding to the smallest of those using <code>Series.idxmin</code>: <pre class="prettyprint"><code>In [45]: df['value'] = df.T.apply(lambda x: x.nlargest(2).idxmin()) In [46]: df Out[46]: A B C D value a1 1.1 2 3.3 4 C a2 2.7 10 5.4 7 D a3 5.3 9 1.5 15 B </code></pre> It is worth noting that picking <code>Series.idxmin</code> over <code>DataFrame.idxmin</code> can make a difference performance-wise: <pre class="prettyprint"><code>df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D']) %timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 39.8 ms ± 2.66 ms %timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 53.6 ms ± 362 µs </code></pre> Edit: Adding to @jpp's answer, if performance matters, you can gain a significant speed-up by using Numba, writing the code as if this were C and compiling it: <pre class="prettyprint"><code>from numba import njit, prange @njit def arg_second_largest(arr): args = np.empty(len(arr), dtype=np.int_) for k in range(len(arr)): a = arr[k] second = np.NINF arg_second = 0 first = np.NINF arg_first = 0 for i in range(len(a)): x = a[i] if x >= first: second = first first = x arg_second = arg_first arg_first = i elif x >= second: second = x arg_second = i args[k] = arg_second return args </code></pre> Let's compare the different solutions on two sets of data with shapes <code>(1000, 4)</code> and <code>(1000, 1000)</code> respectively: <pre class="prettyprint"><code>df = pd.DataFrame(np.random.normal(size=(1000, 4))) %timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 429 ms ± 5.1 ms %timeit df.columns[df.values.argsort(1)[:, -2]] # 94.7 µs ± 2.15 µs %timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 101 µs ± 1.07 µs %timeit df.columns[arg_second_largest(df.values)] # 74.1 µs ± 775 ns df = pd.DataFrame(np.random.normal(size=(1000, 1000))) %timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 1.8 s ± 49.7 ms %timeit df.columns[df.values.argsort(1)[:, -2]] # 52.1 ms ± 1.44 ms %timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 14.6 ms ± 145 µs %timeit df.columns[arg_second_largest(df.values)] # 1.11 ms ± 22.6 µs </code></pre> In the last case, I was able to squeeze out a bit more and get the benchmark down to 852 µs by using <code>@njit(parallel=True)</code> and replacing the outer loop with <code>for k in prange(len(arr))</code>.

How to get column name for second largest row value in pandas DataFrame

Tags:

python

sorting

pandas

dataframe

numpy

I have a pretty simple question - I think - but it seems I can't wrap my head around this one. I am a beginner with Python and Pandas. I searched the forum but couldn't get a (recent) answer that fits my need.

I have a data frame such as this one:

df = pd.DataFrame({'A': [1.1, 2.7, 5.3], 'B': [2, 10, 9], 'C': [3.3, 5.4, 1.5], 'D': [4, 7, 15]}, index = ['a1', 'a2', 'a3'])

Which gives:

          A   B    C   D
    a1  1.1   2  3.3   4
    a2  2.7  10  5.4   7
    a3  5.3   9  1.5  15

My question is simple : I would like to add a column that gives the column name of the second max value of each row.

I have written a simple function which returns the second max value for each row

def get_second_best(x):
    return sorted(x)[-2]

df['value'] = df.apply(lambda row: get_second_best(row), axis=1)

Which gives:

      A   B    C   D  value
a1  1.1   2  3.3   4    3.3
a2  2.7  10  5.4   7    7.0
a3  5.3   9  1.5  15    9.0

But I can't find how to display the column name in the 'value' column, instead of the value... I'm thinking about boolean indexing (comparing the 'value' column values with each row), but I haven't figured out how to do it.

To be clearer, I would like it to be:

      A   B    C   D  value
a1  1.1   2  3.3   4    C
a2  2.7  10  5.4   7    D
a3  5.3   9  1.5  15    B

Any help (and explanation) appreciated!

308

asked Sep 23 '18 09:09

prcbnt

2 Answers

One approach would be to pick out the two largest elements in each row using Series.nlargest and find the column corresponding to the smallest of those using Series.idxmin:

In [45]: df['value'] = df.T.apply(lambda x: x.nlargest(2).idxmin())

In [46]: df
Out[46]:
      A   B    C   D value
a1  1.1   2  3.3   4     C
a2  2.7  10  5.4   7     D
a3  5.3   9  1.5  15     B

It is worth noting that picking Series.idxmin over DataFrame.idxmin can make a difference performance-wise:

df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 39.8 ms ± 2.66 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 53.6 ms ± 362 µs

Edit: Adding to @jpp's answer, if performance matters, you can gain a significant speed-up by using Numba, writing the code as if this were C and compiling it:

from numba import njit, prange

@njit
def arg_second_largest(arr):
    args = np.empty(len(arr), dtype=np.int_)
    for k in range(len(arr)):
        a = arr[k]
        second = np.NINF
        arg_second = 0
        first = np.NINF
        arg_first = 0
        for i in range(len(a)):
            x = a[i]
            if x >= first:
                second = first
                first = x
                arg_second = arg_first
                arg_first = i
            elif x >= second:
                second = x
                arg_second = i
        args[k] = arg_second
    return args

Let's compare the different solutions on two sets of data with shapes (1000, 4) and (1000, 1000) respectively:

df = pd.DataFrame(np.random.normal(size=(1000, 4)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())     # 429 ms ± 5.1 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]          # 94.7 µs ± 2.15 µs
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 101 µs ± 1.07 µs
%timeit df.columns[arg_second_largest(df.values)]        # 74.1 µs ± 775 ns

df = pd.DataFrame(np.random.normal(size=(1000, 1000)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())     # 1.8 s ± 49.7 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]          # 52.1 ms ± 1.44 ms
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 14.6 ms ± 145 µs
%timeit df.columns[arg_second_largest(df.values)]        # 1.11 ms ± 22.6 µs

In the last case, I was able to squeeze out a bit more and get the benchmark down to 852 µs by using @njit(parallel=True) and replacing the outer loop with for k in prange(len(arr)).

answered Sep 24 '22 12:09

fuglede

Here's one solution using NumPy. The idea is to argsort the values in your dataframe, select the second last column, and finally use this to index df.column.

df['value'] = df.columns[df.values.argsort(1)[:, -2]]

print(df)

      A   B    C   D value
a1  1.1   2  3.3   4     C
a2  2.7  10  5.4   7     D
a3  5.3   9  1.5  15     B

You should find this more efficient than Pandas-based solutions:

# Python 3.6, NumPy 1.14.3, Pandas 0.23.0

np.random.seed(0)

df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])

%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())  # 49.6 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin()  # 73.2 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]       # 36.3 µs

answered Sep 25 '22 12:09

jpp

Related questions
                            
                                Display matplotlib graph in browser
                            
                                What's the purpose of giving an alias to an builtin function in Python
                            
                                How to download files from s3 given the file path using boto3 in python
                            
                                Using `super()` within `__init_subclass__` doesn't find parent's classmethod [duplicate]
                            
                                TypeError: Inheritance a class from URL is forbidden
                            
                                get file metadata from S3 using Python boto
                            
                                How to know the number of tree created in XGBoost
                            
                                Why is super().__init__(*args,**kwargs) being used when class doesn't specify a superclass?
                            
                                How can I get data from Django Headers?
                            
                                pandas read in MultiIndex data from csv file
                            
                                Python 3 - Google Drive API: AttributeError: 'Resource' object has no attribute 'children'
                            
                                Gensim Word2Vec select minor set of word vectors from pretrained model
                            
                                dask: specify number of processes
                            
                                Mouseover event for a PyQT5 Label
                            
                                Calculate days until your next birthday in python
                            
                                How could I detect subtypes in pandas object columns?
                            
                                Python : Django TypeError: object() takes no parameters
                            
                                Pandas: Conditionally replace values based on other columns values
                            
                                Django 2.1 - 'functools.partial' object has no attribute '__name__'
                            
                                How to convert RGB images to grayscale in PyTorch dataloader?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With