Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding n lowest values for each row in a dataframe

Tags:

python

pandas

min

I have a large datarframe with 1739 rows and 1455 columns. I want to find the 150 lowest values for each row (Not the the 150 th value but 150 values).

I iterate over rows with a basic for loop.

I tried df.min(axis=1) but it only gives out one min. And also the rolling_min function without success.

Is there any existing function where i can enter the number of values i want to find witn .min?

My ultimate goal is to take the 150 lowest values and create a slope then calculate the area under the curve. Do this for each row and add the areas to obtain a volume.

Example of the the dataframe, I have a df that looks like this:

     -218.7     -218.4    ...          217.2      217.5
0     56.632706  13.638315    ...      76.543000  76.543000
1     56.633455  13.576762    ...      76.543000  76.543000
2    -18.432203 -18.384091    ...      76.543000  76.543000
3    -18.476594 -18.439804    ...      76.543000  76.543000

The header is the '-218.7 ...' which are the coordinates in the x axis of a scan. The data is the height of the scan the y axis. What i need is the 150 lowest values for each rows and there associated column header as i want to make a curve for each row then calculate the area under the curve.

So i need for each line something like this :

         -218.7     -218.4    ... for 150 columns
4    -18.532035 -18.497517    ... for 150 values

I don't think i need to store the header info for each line, a for loop would go trough each row one at a time.

like image 572
Mat17 Avatar asked Feb 25 '19 19:02

Mat17


2 Answers

Use .argsort to get the indices of the underlying array sorted. Slice the values and the column Index to get all of the information you need. We'll create a MultiIndex so we can store both the column headers and values in the same DataFrame. The first level will be your nth lowest indicator.

Example:

import pandas as pd
import numpy as np

np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,100000, (1739, 26)))
df.columns = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')

N = 7  # 150 in your case
idx = np.argsort(df.values, 1)[:, 0:N]

pd.concat([pd.DataFrame(np.take_along_axis(df.to_numpy(), idx, axis=1), index=df.index),
           pd.DataFrame(df.columns.to_numpy(), index=df.index)],
           keys=['Value', 'Columns'], axis=1)

Output:

      Value                                           Columns                  
          0      1      2      3      4      5      6       0  1  2  3  4  5  6
0      5193   7752   8445  19947  20610  21441  21759       C  K  U  V  I  G  P
1       432   3607  16278  17138  19434  26104  33879       R  J  W  C  B  D  G
2        16   1047   1845   9553  12314  13784  19432       K  S  E  F  M  O  U
3       244   5272  10836  13682  29237  33230  34448       K  Q  A  S  X  W  G
4      9765  11275  13160  22808  30870  33484  42760       K  T  L  U  C  D  M
5      2034   2179   4980   7184  14826  15238  22807       Z  H  F  Q  L  R  X
...
like image 89
ALollz Avatar answered Sep 21 '22 00:09

ALollz


You can use heapq.nsmallest to find the n smallest numbers in a list. This can be quickly applied to each row of a dataframe using .apply:

import pandas as pd
import numpy as np
import heapq

df = pd.DataFrame(np.random.randn(1000, 1000))

# Find the 150 smallest values in each row
smallest = df.apply(lambda x: heapq.nsmallest(150, x), axis=1)

Each row of smallest is now a list of the 150 smallest values in the corresponding row in df.

This can be converted to a dataframe using:

smallest_df = pd.DataFrame(smallest.values.tolist())

This is now a dataframe where each row corresponds to each row in the original dataframe. There are 150 columns, with the 150 smallest values in each row of the original.

smallest_df.head()

smallest_df

like image 27
willk Avatar answered Sep 21 '22 00:09

willk