Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Chi Squared for categorical values in large dataset

I no experience of note with Python, and am trying to use it for a statistical analysis of a very large dataset (10 million cases) because the other options (SPSS and R) are unable to handle the dataset on the authorized hardware.

In this dataset, there are many categorical variables (Diagnosis1, Diagnosis2...Diagnosis30) and an Event variable (the dependent variable).
Cases are listed as rows.

Something like this

Diagnosis1       Diagnosis2         Diagnosis3   Event
1                0                  0            1
0                1                  0            0 
0                1                  0            0 

....and so on

I can load the data and review it with this -

    import pandas as pd
    import numpy as np
    NRD_Data = pd.read_csv('NRD_DL.csv')
    NRD_Data.head()

but I am stuck on how to build 2x2 tables and perform a Chi Square test on the tables.

            Diagnosis1=1   Diagnosis1=0
Event=1     100            12
Event=0     80             45

Something akin to running cross-tabs on SPSS to compare categorial values is the desired result.

like image 948
RROBINSON Avatar asked Oct 18 '22 03:10

RROBINSON


1 Answers

Using pd.crosstab to get the matrix you need , then you can do your Chi Square test

l=['Diagnosis1',  'Diagnosis2',  'Diagnosis3']
d=[]
for i in l:
    d.append(pd.crosstab(df['Event'],df[i]))
d[0]
Out[569]: 
Diagnosis1  0  1
Event           
0           2  0
1           0  1
like image 138
BENY Avatar answered Oct 23 '22 07:10

BENY