Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select all rows which contain values greater than a threshold?

The request is simple: I want to select all rows which contain a value greater than a threshold.

If I do it like this:

df[(df > threshold)]

I get these rows, but values below that threshold are simply NaN. How do I avoid selecting these rows?

like image 948
Stefan Falk Avatar asked Mar 05 '17 20:03

Stefan Falk


People also ask

How do you subset rows of an R data frame if all columns have values greater than a certain value?

First of all, create a data frame. Then, use filter_all function of dplyr package with any_vars function to subset the rows of the data frame for any columns having values greater than a certain value.

How do you select a range in a data frame?

Select Data Using Location Index (. This means that you can use dataframe. iloc[0:1, 0:1] to select the cell value at the intersection of the first row and first column of the dataframe. You can expand the range for either the row index or column index to select more data.


2 Answers

There is absolutely no need for the double transposition - you can simply call any along the column index (supplying 1 or 'columns') on your Boolean matrix.

df[(df > threshold).any(1)]

Example

>>> df = pd.DataFrame(np.random.randint(0, 100, 50).reshape(5, 10))

>>> df

    0   1   2   3   4   5   6   7   8   9
0  45  53  89  63  62  96  29  56  42   6
1   0  74  41  97  45  46  38  39   0  49
2  37   2  55  68  16  14  93  14  71  84
3  67  45  79  75  27  94  46  43   7  40
4  61  65  73  60  67  83  32  77  33  96

>>> df[(df > 95).any(1)]

    0   1   2   3   4   5   6   7   8   9
0  45  53  89  63  62  96  29  56  42   6
1   0  74  41  97  45  46  38  39   0  49
4  61  65  73  60  67  83  32  77  33  96

Transposing as your self-answer does is just an unnecessary performance hit.

df = pd.DataFrame(np.random.randint(0, 100, 10**8).reshape(10**4, 10**4))

# standard way
%timeit df[(df > 95).any(1)]
1 loop, best of 3: 8.48 s per loop

# transposing
%timeit df[df.T[(df.T > 95)].any()]
1 loop, best of 3: 13 s per loop
like image 72
miradulo Avatar answered Sep 18 '22 01:09

miradulo


This is actually very simple:

df[df.T[(df.T > 0.33)].any()]
like image 21
Stefan Falk Avatar answered Sep 19 '22 01:09

Stefan Falk