Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Counting the proportion of zeros in rows and columns of dataframe

I have this code below. It is surprizing for me that it works for the columns and not for the rows.

import pandas as pd

def summarizing_data_variables(df):
    numberRows=size(df['ID'])
    numberColumns=size(df.columns)
    summaryVariables=np.empty([numberColumns,2], dtype =  np.dtype('a50'))    
    cont=-1    
    for column in df.columns:
        cont=cont+1
        summaryVariables[cont][0]=column
        summaryVariables[cont][1]=size(df[df[column].isin([0])][column])/(1.0*numberRows)
    print summaryVariables

def summarizing_data_users(fileName):
    print "Sumarizing users..."   
    numberRows=size(df['ID'])
    numberColumns=size(df.columns)      
    summaryVariables=np.empty([numberRows,2], dtype =  np.dtype('a50'))    
    cont=-1

    for row in df['ID']:
        cont=cont+1
        summaryVariables[cont][0]=row
        dft=df[df['ID']==row]
        proportionZeros=(size(dft[dft.isin([0])])-1)/(1.0*(numberColumns-1)) # THe -1 is used to not count the ID column
        summaryVariables[cont][1]=proportionZeros
    print summaryVariables


if __name__ == '__main__':

    df = pd.DataFrame([[1, 2, 3], [2, 5, 0.0],[3,4,5]])
    df.columns=['ID','var1','var2']
    print df

    summarizing_data_variables(df)
    summarizing_data_users(df) 

The output is this:

   ID  var1  var2
0   1     2     3
1   2     5     0
2   3     4     5
[['ID' '0.0']
 ['var1' '0.0']
 ['var2' '0.333333333333']]
Sumarizing users...
[['1' '1.0']
 ['2' '1.0']
 ['3' '1.0']]

I was expecting that for users:

Sumarizing users...
[['1' '0.0']
 ['2' '0.5']
 ['3' '0.0']]

It seems that the problem is in this line:

dft[dft.isin([0])]

It does not constrain dft to the "True" values like in the first case.

Can you help me with this? (1) How to correct the users (ROWS) part (second function above)? (2) Is this the most efficient method to do this? [My database is very big]

EDIT:

In function summarizing_data_variables(df) I try to evaluate the proportion of zeros in each column. In the example above, the variable Id has no zero (thus the proportion is zero), the variable var1 has no zero (thus the proportion is also zero) and the variable var2 presents a zero in the second row (thus the proportion is 1/3). I keep these values in a 2D numpy.array where the first column is the label of the column of the dataframe and the second column is the evaluated proportion.

The function summarizing_data_users I want to do the same, but I do that for each row. However, it is NOT working.

like image 683
DanielTheRocketMan Avatar asked Mar 06 '16 16:03

DanielTheRocketMan


People also ask

How do you count the number of zeros in a column?

Select a blank cell and type this formula =COUNTIF(A1:H8,0) into it, and press Enter key, now all the zero cells excluding blank cells are counted out. Tip: In the above formula, A1:H8 is the data range you want to count the zeros from, you can change it as you need.

Can you write a program to count the number of rows and columns in a DataFrame?

columns represents columns. So, len(dataframe. index) and len(dataframe. columns) gives count of rows and columns respectively.

How does Pandas calculate row percentage?

You can caluclate pandas percentage with total by groupby() and DataFrame. transform() method. The transform() method allows you to execute a function for each value of the DataFrame. Here, the percentage directly summarized DataFrame, then the results will be calculated using all the data.


2 Answers

try this instead of the first funtion:

print(df[df == 0].count(axis=1)/len(df.columns))

UPDATE (correction):

print('rows')
print(df[df == 0].count(axis=1)/len(df.columns))
print('cols')
print(df[df == 0].count(axis=0)/len(df.index))

Input data (i've decided to add a few rows):

ID  var1  var2
1     2     3
2     5     0
3     4     5
4    10    10
5    1      0

Output:

rows
ID
1    0.0
2    0.5
3    0.0
4    0.0
5    0.5
dtype: float64
cols
var1    0.0
var2    0.4
dtype: float64
like image 62
MaxU - stop WAR against UA Avatar answered Oct 28 '22 12:10

MaxU - stop WAR against UA


My favorite way of getting number of nonzeros in each column is

df.astype(bool).sum(axis=0)

For the number of non-zeros in each row use

df.astype(bool).sum(axis=1)

Notice:

If you have nans in your df you should make these zero first, otherwise they will be counted as 1.

df.fillna(0).astype(bool).sum(axis=1)
like image 26
Kevin Chou Avatar answered Oct 28 '22 12:10

Kevin Chou