Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get column name for second largest row value in pandas DataFrame

I have a pretty simple question - I think - but it seems I can't wrap my head around this one. I am a beginner with Python and Pandas. I searched the forum but couldn't get a (recent) answer that fits my need.

I have a data frame such as this one:

df = pd.DataFrame({'A': [1.1, 2.7, 5.3], 'B': [2, 10, 9], 'C': [3.3, 5.4, 1.5], 'D': [4, 7, 15]}, index = ['a1', 'a2', 'a3'])

Which gives:

          A   B    C   D
    a1  1.1   2  3.3   4
    a2  2.7  10  5.4   7
    a3  5.3   9  1.5  15

My question is simple : I would like to add a column that gives the column name of the second max value of each row.

I have written a simple function which returns the second max value for each row

def get_second_best(x):
    return sorted(x)[-2]

df['value'] = df.apply(lambda row: get_second_best(row), axis=1)

Which gives:

      A   B    C   D  value
a1  1.1   2  3.3   4    3.3
a2  2.7  10  5.4   7    7.0
a3  5.3   9  1.5  15    9.0

But I can't find how to display the column name in the 'value' column, instead of the value... I'm thinking about boolean indexing (comparing the 'value' column values with each row), but I haven't figured out how to do it.

To be clearer, I would like it to be:

      A   B    C   D  value
a1  1.1   2  3.3   4    C
a2  2.7  10  5.4   7    D
a3  5.3   9  1.5  15    B

Any help (and explanation) appreciated!

like image 308
prcbnt Avatar asked Sep 23 '18 09:09

prcbnt


People also ask

How do you get the value of a column for a specific row in Pandas?

In Pandas, DataFrame. loc[] property is used to get a specific cell value by row & label name(column name).

How do you get a col name in Pandas?

You can get the column names from pandas DataFrame using df. columns. values , and pass this to python list() function to get it as list, once you have the data you can print it using print() statement.

How do I refer to a specific row in Pandas?

In the Pandas DataFrame we can find the specified row value with the using function iloc(). In this function we pass the row number as parameter.

How do I get the second last row of a DataFrame in Pandas?

df. iloc[-2] will get you the penultimate row info for all columns. Where df. shape[0] gets your row count, and -2 removes 2 from it to give you the index number for your penultimate row.


2 Answers

One approach would be to pick out the two largest elements in each row using Series.nlargest and find the column corresponding to the smallest of those using Series.idxmin:

In [45]: df['value'] = df.T.apply(lambda x: x.nlargest(2).idxmin())

In [46]: df
Out[46]:
      A   B    C   D value
a1  1.1   2  3.3   4     C
a2  2.7  10  5.4   7     D
a3  5.3   9  1.5  15     B

It is worth noting that picking Series.idxmin over DataFrame.idxmin can make a difference performance-wise:

df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 39.8 ms ± 2.66 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 53.6 ms ± 362 µs

Edit: Adding to @jpp's answer, if performance matters, you can gain a significant speed-up by using Numba, writing the code as if this were C and compiling it:

from numba import njit, prange

@njit
def arg_second_largest(arr):
    args = np.empty(len(arr), dtype=np.int_)
    for k in range(len(arr)):
        a = arr[k]
        second = np.NINF
        arg_second = 0
        first = np.NINF
        arg_first = 0
        for i in range(len(a)):
            x = a[i]
            if x >= first:
                second = first
                first = x
                arg_second = arg_first
                arg_first = i
            elif x >= second:
                second = x
                arg_second = i
        args[k] = arg_second
    return args

Let's compare the different solutions on two sets of data with shapes (1000, 4) and (1000, 1000) respectively:

df = pd.DataFrame(np.random.normal(size=(1000, 4)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())     # 429 ms ± 5.1 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]          # 94.7 µs ± 2.15 µs
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 101 µs ± 1.07 µs
%timeit df.columns[arg_second_largest(df.values)]        # 74.1 µs ± 775 ns

df = pd.DataFrame(np.random.normal(size=(1000, 1000)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())     # 1.8 s ± 49.7 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]          # 52.1 ms ± 1.44 ms
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 14.6 ms ± 145 µs
%timeit df.columns[arg_second_largest(df.values)]        # 1.11 ms ± 22.6 µs

In the last case, I was able to squeeze out a bit more and get the benchmark down to 852 µs by using @njit(parallel=True) and replacing the outer loop with for k in prange(len(arr)).

like image 66
fuglede Avatar answered Sep 24 '22 12:09

fuglede


Here's one solution using NumPy. The idea is to argsort the values in your dataframe, select the second last column, and finally use this to index df.column.

df['value'] = df.columns[df.values.argsort(1)[:, -2]]

print(df)

      A   B    C   D value
a1  1.1   2  3.3   4     C
a2  2.7  10  5.4   7     D
a3  5.3   9  1.5  15     B

You should find this more efficient than Pandas-based solutions:

# Python 3.6, NumPy 1.14.3, Pandas 0.23.0

np.random.seed(0)

df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])

%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())  # 49.6 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin()  # 73.2 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]       # 36.3 µs
like image 42
jpp Avatar answered Sep 25 '22 12:09

jpp