I'm trying to run shapiro test for each column in pandas dataframe based on column "code".
This is how my df looks like:
>>>name code 2020-10-22 2020-10-23 2020-10-24 ...
0 a 1 0.05423 0.1254 0.1432
1 b 1 0.57289 0.0092 0.2314
2 c 2 0.1205 0.0072 0.12
3 d 3 0.3234 0.231 0.231
...
I have 80 rows with 6 different codes (1,2,3,4,5,6).
I want to run the Shapiro test for each columns, for each code, for example, to take teh column of 2020-10-22, take only the rows with treatment no. 1 and run the shapiro test on them.
I have tried to do it using the following loop:
shapiros=[]
for variable in df.columns[2:]:
tmp=df[['code',variable]]
tmp=tmp[tmp[variable].notnull()]
for i in tmp.code.unique().tolist():
shapiro_test = stats.shapiro(tmp[tmp['code'] == i])
shapiros.append(shapiro_test)
but then I get error :
---> 13 shapiro_test = stats.shapiro(tmp[tmp['code'] == i])
TypeError: '<' not supported between instances of 'float' and 'str'
I saw this error can occure due to having null values but I have gotten rid of this using the notnull(). I have checked teh notnull works by print the length of "tmp" in each iteration and it does change.
In addition, seems like the type of both is the same- object:
for variable in df.columns[2:]:
tmp=df[['code',variable]]
print(tmp.dtypes)
tmp=tmp[tmp[variable].notnull()]
for i in tmp.code.unique().tolist():
print(type(i))
>>>code object
2020-10-22 float64
dtype: object
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
...
(it prints the same for all the days).
What can be the problem? how can I calculate the shapiro for each column for each code?
You have to convert column Code to float/int to compare, as per your code, it currently is str. Try doing:
df['code'] = df['code'].astype(float)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With