I have a pretty simple question - I think - but it seems I can't wrap my head around this one. I am a beginner with Python and Pandas. I searched the forum but couldn't get a (recent) answer that fits my need.
I have a data frame such as this one:
df = pd.DataFrame({'A': [1.1, 2.7, 5.3], 'B': [2, 10, 9], 'C': [3.3, 5.4, 1.5], 'D': [4, 7, 15]}, index = ['a1', 'a2', 'a3'])
Which gives:
A B C D
a1 1.1 2 3.3 4
a2 2.7 10 5.4 7
a3 5.3 9 1.5 15
My question is simple : I would like to add a column that gives the column name of the second max value of each row.
I have written a simple function which returns the second max value for each row
def get_second_best(x):
return sorted(x)[-2]
df['value'] = df.apply(lambda row: get_second_best(row), axis=1)
Which gives:
A B C D value
a1 1.1 2 3.3 4 3.3
a2 2.7 10 5.4 7 7.0
a3 5.3 9 1.5 15 9.0
But I can't find how to display the column name in the 'value' column, instead of the value... I'm thinking about boolean indexing (comparing the 'value' column values with each row), but I haven't figured out how to do it.
To be clearer, I would like it to be:
A B C D value
a1 1.1 2 3.3 4 C
a2 2.7 10 5.4 7 D
a3 5.3 9 1.5 15 B
Any help (and explanation) appreciated!
In Pandas, DataFrame. loc[] property is used to get a specific cell value by row & label name(column name).
You can get the column names from pandas DataFrame using df. columns. values , and pass this to python list() function to get it as list, once you have the data you can print it using print() statement.
In the Pandas DataFrame we can find the specified row value with the using function iloc(). In this function we pass the row number as parameter.
df. iloc[-2] will get you the penultimate row info for all columns. Where df. shape[0] gets your row count, and -2 removes 2 from it to give you the index number for your penultimate row.
One approach would be to pick out the two largest elements in each row using Series.nlargest
and find the column corresponding to the smallest of those using Series.idxmin
:
In [45]: df['value'] = df.T.apply(lambda x: x.nlargest(2).idxmin())
In [46]: df
Out[46]:
A B C D value
a1 1.1 2 3.3 4 C
a2 2.7 10 5.4 7 D
a3 5.3 9 1.5 15 B
It is worth noting that picking Series.idxmin
over DataFrame.idxmin
can make a difference performance-wise:
df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 39.8 ms ± 2.66 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 53.6 ms ± 362 µs
Edit: Adding to @jpp's answer, if performance matters, you can gain a significant speed-up by using Numba, writing the code as if this were C and compiling it:
from numba import njit, prange
@njit
def arg_second_largest(arr):
args = np.empty(len(arr), dtype=np.int_)
for k in range(len(arr)):
a = arr[k]
second = np.NINF
arg_second = 0
first = np.NINF
arg_first = 0
for i in range(len(a)):
x = a[i]
if x >= first:
second = first
first = x
arg_second = arg_first
arg_first = i
elif x >= second:
second = x
arg_second = i
args[k] = arg_second
return args
Let's compare the different solutions on two sets of data with shapes (1000, 4)
and (1000, 1000)
respectively:
df = pd.DataFrame(np.random.normal(size=(1000, 4)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 429 ms ± 5.1 ms
%timeit df.columns[df.values.argsort(1)[:, -2]] # 94.7 µs ± 2.15 µs
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 101 µs ± 1.07 µs
%timeit df.columns[arg_second_largest(df.values)] # 74.1 µs ± 775 ns
df = pd.DataFrame(np.random.normal(size=(1000, 1000)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 1.8 s ± 49.7 ms
%timeit df.columns[df.values.argsort(1)[:, -2]] # 52.1 ms ± 1.44 ms
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 14.6 ms ± 145 µs
%timeit df.columns[arg_second_largest(df.values)] # 1.11 ms ± 22.6 µs
In the last case, I was able to squeeze out a bit more and get the benchmark down to 852 µs by using @njit(parallel=True)
and replacing the outer loop with for k in prange(len(arr))
.
Here's one solution using NumPy. The idea is to argsort
the values in your dataframe, select the second last column, and finally use this to index df.column
.
df['value'] = df.columns[df.values.argsort(1)[:, -2]]
print(df)
A B C D value
a1 1.1 2 3.3 4 C
a2 2.7 10 5.4 7 D
a3 5.3 9 1.5 15 B
You should find this more efficient than Pandas-based solutions:
# Python 3.6, NumPy 1.14.3, Pandas 0.23.0
np.random.seed(0)
df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 49.6 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 73.2 ms
%timeit df.columns[df.values.argsort(1)[:, -2]] # 36.3 µs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With