I no experience of note with Python, and am trying to use it for a statistical analysis of a very large dataset (10 million cases) because the other options (SPSS and R) are unable to handle the dataset on the authorized hardware.
In this dataset, there are many categorical variables (Diagnosis1, Diagnosis2...Diagnosis30) and an Event variable (the dependent variable).
Cases are listed as rows.
Something like this
Diagnosis1 Diagnosis2 Diagnosis3 Event
1 0 0 1
0 1 0 0
0 1 0 0
....and so on
I can load the data and review it with this -
import pandas as pd
import numpy as np
NRD_Data = pd.read_csv('NRD_DL.csv')
NRD_Data.head()
but I am stuck on how to build 2x2 tables and perform a Chi Square test on the tables.
Diagnosis1=1 Diagnosis1=0
Event=1 100 12
Event=0 80 45
Something akin to running cross-tabs on SPSS to compare categorial values is the desired result.
Using pd.crosstab
to get the matrix you need , then you can do your Chi Square test
l=['Diagnosis1', 'Diagnosis2', 'Diagnosis3']
d=[]
for i in l:
d.append(pd.crosstab(df['Event'],df[i]))
d[0]
Out[569]:
Diagnosis1 0 1
Event
0 2 0
1 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With