I am using scipy.stats.chi2_contingency method to get chi square statistics. We need to pass frequency table i.e. contingency table as parameter. But I have a feature vector and want to automatically generate the frequency table. Do we have any such function available? I am doing it like this currently:
def contigency_matrix_categorical(data_series,target_series,target_val,indicator_val):
observed_freq={}
for targets in target_val:
observed_freq[targets]={}
for indicators in indicator_val:
observed_freq[targets][indicators['val']]=data_series[((target_series==targets)&(data_series==indicators['val']))].count()
f_obs=[]
var1=0
var2=0
for i in observed_freq:
var1=var1+1
var2=0
for j in observed_freq[i]:
f_obs.append(observed_freq[i][j]+5)
var2=var2+1
arr=np.array(f_obs).reshape(var1,var2)
c,p,dof,expected=chi2_contingency(arr)
return {'score':c,'pval':p,'dof':dof}
Where data series and target series are the columns values and the other two are the name of the indicator. Can anyone help? thanks
Contingency tables are constructed by listing all the levels of one variable as rows in a table and the levels of the other variables as columns, then finding the joint or cell frequency for each cell. The cell frequencies are then summed across both rows and columns.
When analysis of categorical data is concerned with more than one variable, two-way tables (also known as contingency tables) are employed.
You can use pandas.crosstab
to generate a contingency table from a DataFrame. From the documentation:
Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
Below is an usage example:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
# Some fake data.
n = 5 # Number of samples.
d = 3 # Dimensionality.
c = 2 # Number of categories.
data = np.random.randint(c, size=(n, d))
data = pd.DataFrame(data, columns=['CAT1', 'CAT2', 'CAT3'])
# Contingency table.
contingency = pd.crosstab(data['CAT1'], data['CAT2'])
# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contingency)
The following data
table
generates the following contingency
table
Then, scipy.stats.chi2_contingency(contingency)
returns (0.052, 0.819, 1, array([[1.6, 0.4],[2.4, 0.6]]))
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With