Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can we generate contingency table for chisquare test using python?

I am using scipy.stats.chi2_contingency method to get chi square statistics. We need to pass frequency table i.e. contingency table as parameter. But I have a feature vector and want to automatically generate the frequency table. Do we have any such function available? I am doing it like this currently:

def contigency_matrix_categorical(data_series,target_series,target_val,indicator_val):
  observed_freq={}
  for targets in target_val:
      observed_freq[targets]={}
      for indicators in indicator_val:
          observed_freq[targets][indicators['val']]=data_series[((target_series==targets)&(data_series==indicators['val']))].count()
  f_obs=[]
  var1=0
  var2=0
  for i in observed_freq:
      var1=var1+1
      var2=0
      for j in observed_freq[i]:
          f_obs.append(observed_freq[i][j]+5)
          var2=var2+1
  arr=np.array(f_obs).reshape(var1,var2)
  c,p,dof,expected=chi2_contingency(arr)
  return {'score':c,'pval':p,'dof':dof}

Where data series and target series are the columns values and the other two are the name of the indicator. Can anyone help? thanks

like image 255
icm Avatar asked Jul 15 '14 20:07

icm


People also ask

How do you make a contingency table for Chi-Square?

Contingency tables are constructed by listing all the levels of one variable as rows in a table and the levels of the other variables as columns, then finding the joint or cell frequency for each cell. The cell frequencies are then summed across both rows and columns.

Which table is used in Chi-square test?

When analysis of categorical data is concerned with more than one variable, two-way tables (also known as contingency tables) are employed.


1 Answers

You can use pandas.crosstab to generate a contingency table from a DataFrame. From the documentation:

Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

Below is an usage example:

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Some fake data.
n = 5  # Number of samples.
d = 3  # Dimensionality.
c = 2  # Number of categories.
data = np.random.randint(c, size=(n, d))
data = pd.DataFrame(data, columns=['CAT1', 'CAT2', 'CAT3'])

# Contingency table.
contingency = pd.crosstab(data['CAT1'], data['CAT2'])

# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contingency)

The following data table

generates the following contingency table

Then, scipy.stats.chi2_contingency(contingency) returns (0.052, 0.819, 1, array([[1.6, 0.4],[2.4, 0.6]])).

like image 193
mdeff Avatar answered Sep 20 '22 22:09

mdeff