Run df.describe() for non-zero values only

Question

I have a dataframe daily that looks like this

import pandas as pd
daily

time_stamp  22          72      79          86      87          88          90  
2013-10-01  0.000000    0.000   8.128000    0.254   0.000000    0.000000    0.000000
2013-10-01  0.000000    0.000   8.128000    0.254   0.000000    0.000000    0.000000
2013-10-02  0.000000    0.000   0.000000    0.000   0.000000    0.000000    0.000000
2013-10-04  0.000000    0.000   0.000000    0.000   2.540000    0.762000    0.000000
2013-10-08  2.286000    0.000   0.000000    1.016   1.016000    0.254000    0.000000
2013-10-11  2.794000    0.000   0.000000    0.000   3.810000    1.016000    0.762000
2013-10-12  1.524000    0.000   0.000000    2.286   5.588000    0.254000    26.41600
2013-10-13  0.762000    0.000   8.890000    0.000   2.540000    1.270000    4.572000
2013-10-14  1.524000    0.000   0.000000    0.000   2.540000    4.064000    0.000000
2013-10-15  0.000000    0.000   0.000000    0.000   0.000000    0.000000    0.000000
2013-10-16  0.000000    3.810   1.524000    3.048   0.508000    0.762000    5.080000
2013-10-17  0.000000    0.000   0.254000    0.000   0.000000    0.000000    0.508000
2013-10-18  8.128000    0.762   4.826000    0.508   7.366000    4.572000    1.524000
2013-10-19  8.382000    0.254   0.000000    0.000   6.858000    16.510000   2.032000
2013-10-20  0.000000    0.000   0.000000    0.000   4.064000    5.842000    0.000000
2013-10-21  0.000000    0.508   0.000000    0.000   1.016000    0.000000    0.000000
2013-10-22  2.794000    2.540   1.016000    0.000   0.508000    15.748000   0.000000

And I want to do summary statistics so describe() on the values greater than 0.

The problem is if I use the commands dailyrf = daily[(daily > 0.).any(1)] the rows with zeroes are still included when I do dailyrf.describe(). Alternatively, when I do dailyrf = daily[(daily > 0.).all(1)] it only returns rows that have >0 values in all the rows.

I also tried daily[daily==0] = 'NaN' which gave me a warning message: "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until".

And this isn't a solution either because the describe function returns this:

        22  72  79  86  87  88  90  93  95  96  97
count   720 684 721 719 718 720 720 721 720 720 719
unique  103 80  73  64  80  108 112 108 86  113 98
top     NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq    470 494 560 510 539 483 486 441 570 474 476

What I really want is mean, standard deviation, etc. for all the values greater than 0 in each column.

cs95 · Accepted Answer

This should be pretty simple using mask.

df.mask(df == 0).describe()

             22        72        79        86         87         88        90
count  8.000000  5.000000  7.000000  6.000000  12.000000  11.000000   7.00000
mean   3.524250  1.574800  4.680857  1.227667   3.196167   4.641273   5.84200
std    3.000573  1.538745  3.752722  1.174092   2.391229   5.992560   9.24574
min    0.762000  0.254000  0.254000  0.254000   0.508000   0.254000   0.50800
25%    1.524000  0.508000  1.270000  0.317500   1.016000   0.762000   1.14300
50%    2.540000  0.762000  4.826000  0.762000   2.540000   1.270000   2.03200
75%    4.127500  2.540000  8.128000  1.968500   4.445000   5.207000   4.82600
max    8.382000  3.810000  8.890000  3.048000   7.366000  16.510000  26.41600

All values satisfying df == 0 are masked, and describe will not take these into account when calculating stats.

BENY · Answer

To fix your code notice NaN!='NaN'

df[df==0] = np.nan
df.describe()
Out[696]: 
             22        72        79        86         87         88        90
count  8.000000  5.000000  7.000000  6.000000  12.000000  11.000000   7.00000
mean   3.524250  1.574800  4.680857  1.227667   3.196167   4.641273   5.84200
std    3.000573  1.538745  3.752722  1.174092   2.391229   5.992560   9.24574
min    0.762000  0.254000  0.254000  0.254000   0.508000   0.254000   0.50800
25%    1.524000  0.508000  1.270000  0.317500   1.016000   0.762000   1.14300
50%    2.540000  0.762000  4.826000  0.762000   2.540000   1.270000   2.03200
75%    4.127500  2.540000  8.128000  1.968500   4.445000   5.207000   4.82600
max    8.382000  3.810000  8.890000  3.048000   7.366000  16.510000  26.41600

Run df.describe() for non-zero values only

Tags:

python

pandas

dataframe

statistics

JAG2024

2 Answers

cs95

BENY

Recent Activity

Donate For Us

Run df.describe() for non-zero values only

Tags:

python

pandas

dataframe

statistics

JAG2024

2 Answers

cs95

BENY

Related questions

Recent Activity

Donate For Us