I have this code below. It is surprizing for me that it works for the columns and not for the rows.
import pandas as pd
def summarizing_data_variables(df):
numberRows=size(df['ID'])
numberColumns=size(df.columns)
summaryVariables=np.empty([numberColumns,2], dtype = np.dtype('a50'))
cont=-1
for column in df.columns:
cont=cont+1
summaryVariables[cont][0]=column
summaryVariables[cont][1]=size(df[df[column].isin([0])][column])/(1.0*numberRows)
print summaryVariables
def summarizing_data_users(fileName):
print "Sumarizing users..."
numberRows=size(df['ID'])
numberColumns=size(df.columns)
summaryVariables=np.empty([numberRows,2], dtype = np.dtype('a50'))
cont=-1
for row in df['ID']:
cont=cont+1
summaryVariables[cont][0]=row
dft=df[df['ID']==row]
proportionZeros=(size(dft[dft.isin([0])])-1)/(1.0*(numberColumns-1)) # THe -1 is used to not count the ID column
summaryVariables[cont][1]=proportionZeros
print summaryVariables
if __name__ == '__main__':
df = pd.DataFrame([[1, 2, 3], [2, 5, 0.0],[3,4,5]])
df.columns=['ID','var1','var2']
print df
summarizing_data_variables(df)
summarizing_data_users(df)
The output is this:
ID var1 var2
0 1 2 3
1 2 5 0
2 3 4 5
[['ID' '0.0']
['var1' '0.0']
['var2' '0.333333333333']]
Sumarizing users...
[['1' '1.0']
['2' '1.0']
['3' '1.0']]
I was expecting that for users:
Sumarizing users...
[['1' '0.0']
['2' '0.5']
['3' '0.0']]
It seems that the problem is in this line:
dft[dft.isin([0])]
It does not constrain dft to the "True" values like in the first case.
Can you help me with this? (1) How to correct the users (ROWS) part (second function above)? (2) Is this the most efficient method to do this? [My database is very big]
EDIT:
In function summarizing_data_variables(df) I try to evaluate the proportion of zeros in each column. In the example above, the variable Id has no zero (thus the proportion is zero), the variable var1 has no zero (thus the proportion is also zero) and the variable var2 presents a zero in the second row (thus the proportion is 1/3). I keep these values in a 2D numpy.array where the first column is the label of the column of the dataframe and the second column is the evaluated proportion.
The function summarizing_data_users I want to do the same, but I do that for each row. However, it is NOT working.
Select a blank cell and type this formula =COUNTIF(A1:H8,0) into it, and press Enter key, now all the zero cells excluding blank cells are counted out. Tip: In the above formula, A1:H8 is the data range you want to count the zeros from, you can change it as you need.
columns represents columns. So, len(dataframe. index) and len(dataframe. columns) gives count of rows and columns respectively.
You can caluclate pandas percentage with total by groupby() and DataFrame. transform() method. The transform() method allows you to execute a function for each value of the DataFrame. Here, the percentage directly summarized DataFrame, then the results will be calculated using all the data.
try this instead of the first funtion:
print(df[df == 0].count(axis=1)/len(df.columns))
UPDATE (correction):
print('rows')
print(df[df == 0].count(axis=1)/len(df.columns))
print('cols')
print(df[df == 0].count(axis=0)/len(df.index))
Input data (i've decided to add a few rows):
ID var1 var2
1 2 3
2 5 0
3 4 5
4 10 10
5 1 0
Output:
rows
ID
1 0.0
2 0.5
3 0.0
4 0.0
5 0.5
dtype: float64
cols
var1 0.0
var2 0.4
dtype: float64
My favorite way of getting number of nonzeros in each column is
df.astype(bool).sum(axis=0)
For the number of non-zeros in each row use
df.astype(bool).sum(axis=1)
Notice:
If you have nans in your df you should make these zero first, otherwise they will be counted as 1.
df.fillna(0).astype(bool).sum(axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With