Using pandas, calculate Cramér's coefficient matrix

I have a dataframe in pandas which contains metrics calculated on Wikipedia articles. Two categorical variables nation which nation the article is about, and lang which language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.

index   qid     subj    nation  lang    metric          value 5   Q3488399    economy     cdi     fr  informativeness 0.787117 6   Q3488399    economy     cdi     fr  referencerate   0.000945 7   Q3488399    economy     cdi     fr  completeness    43.200000 8   Q3488399    economy     cdi     fr  numheadings     11.000000 9   Q3488399    economy     cdi     fr  articlelength   3176.000000 10  Q7195441    economy     cdi     en  informativeness 0.626570 11  Q7195441    economy     cdi     en  referencerate   0.008610 12  Q7195441    economy     cdi     en  completeness    6.400000 13  Q7195441    economy     cdi     en  numheadings     7.000000 14  Q7195441    economy     cdi     en  articlelength   2323.000000 

I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga'] and three languages ['fr','en','sw']. So there would be a resulting 4 by 3 matrix like:

       en         fr          sw usa    Cramer11   Cramer12    ...  fra    Cramer21   Cramer22    ...  cdi    ... uga    ... 

Eventually then I will do this over all the different metrics I am tracking.

for subject in list_of_subjects:     for metric in list_of_metrics:         cramer_matrix(metric, df) 

Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia. Thanks


notconfusing


2 Answers

cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.

import scipy.stats as ss  def cramers_corrected_stat(confusion_matrix):     """ calculate Cramers V statistic for categorial-categorial association.         uses correction from Bergsma and Wicher,          Journal of the Korean Statistical Society 42 (2013): 323-328     """     chi2 = ss.chi2_contingency(confusion_matrix)[0]     n = confusion_matrix.sum()     phi2 = chi2/n     r,k = confusion_matrix.shape     phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))         rcorr = r - ((r-1)**2)/(n-1)     kcorr = k - ((k-1)**2)/(n-1)     return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))) 

Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:

import pandas as pd confusion_matrix = pd.crosstab(df[column1], df[column2]) 

Ziggy Eunicien

Ziggy Eunicien

A bit modificated function from Ziggy Eunicien answer. 2 modifications added

  1. checking if one of the variables is constant

  2. correction to ss.chi2_contingency(conf_matrix, correction=correct) - FALSE if confusion matrix is 2x2

    import scipy.stats as ss import pandas as pd import numpy as np def cramers_corrected_stat(x,y):

     """ calculate Cramers V statistic for categorial-categorial association.      uses correction from Bergsma and Wicher,       Journal of the Korean Statistical Society 42 (2013): 323-328  """  result=-1  if len(x.value_counts())==1 :      print("First variable is constant")  elif len(y.value_counts())==1:      print("Second variable is constant")  else:         conf_matrix=pd.crosstab(x, y)       if conf_matrix.shape[0]==2:          correct=False      else:          correct=True       chi2 = ss.chi2_contingency(conf_matrix, correction=correct)[0]       n = sum(conf_matrix.sum())      phi2 = chi2/n      r,k = conf_matrix.shape      phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))          rcorr = r - ((r-1)**2)/(n-1)      kcorr = k - ((k-1)**2)/(n-1)      result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))  return round(result,6) 

Yury Wallet

Yury Wallet