Using pandas, calculate Cramér's coefficient matrix

Tags:

I have a dataframe in pandas which contains metrics calculated on Wikipedia articles. Two categorical variables nation which nation the article is about, and lang which language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.

index   qid     subj    nation  lang    metric          value 5   Q3488399    economy     cdi     fr  informativeness 0.787117 6   Q3488399    economy     cdi     fr  referencerate   0.000945 7   Q3488399    economy     cdi     fr  completeness    43.200000 8   Q3488399    economy     cdi     fr  numheadings     11.000000 9   Q3488399    economy     cdi     fr  articlelength   3176.000000 10  Q7195441    economy     cdi     en  informativeness 0.626570 11  Q7195441    economy     cdi     en  referencerate   0.008610 12  Q7195441    economy     cdi     en  completeness    6.400000 13  Q7195441    economy     cdi     en  numheadings     7.000000 14  Q7195441    economy     cdi     en  articlelength   2323.000000

I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga'] and three languages ['fr','en','sw']. So there would be a resulting 4 by 3 matrix like:

       en         fr          sw usa    Cramer11   Cramer12    ...  fra    Cramer21   Cramer22    ...  cdi    ... uga    ...

Eventually then I will do this over all the different metrics I am tracking.

for subject in list_of_subjects:     for metric in list_of_metrics:         cramer_matrix(metric, df)

Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia. Thanks

475

asked Jan 02 '14 22:01

notconfusing

Video Answer

2 Answers

cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.

import scipy.stats as ss  def cramers_corrected_stat(confusion_matrix):     """ calculate Cramers V statistic for categorial-categorial association.         uses correction from Bergsma and Wicher,          Journal of the Korean Statistical Society 42 (2013): 323-328     """     chi2 = ss.chi2_contingency(confusion_matrix)[0]     n = confusion_matrix.sum()     phi2 = chi2/n     r,k = confusion_matrix.shape     phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))         rcorr = r - ((r-1)**2)/(n-1)     kcorr = k - ((k-1)**2)/(n-1)     return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:

import pandas as pd confusion_matrix = pd.crosstab(df[column1], df[column2])

182

answered Oct 10 '22 11:10

Ziggy Eunicien

A bit modificated function from Ziggy Eunicien answer. 2 modifications added

checking if one of the variables is constant

correction to ss.chi2_contingency(conf_matrix, correction=correct) - FALSE if confusion matrix is 2x2

import scipy.stats as ss import pandas as pd import numpy as np def cramers_corrected_stat(x,y):

 """ calculate Cramers V statistic for categorial-categorial association.      uses correction from Bergsma and Wicher,       Journal of the Korean Statistical Society 42 (2013): 323-328  """  result=-1  if len(x.value_counts())==1 :      print("First variable is constant")  elif len(y.value_counts())==1:      print("Second variable is constant")  else:         conf_matrix=pd.crosstab(x, y)       if conf_matrix.shape[0]==2:          correct=False      else:          correct=True       chi2 = ss.chi2_contingency(conf_matrix, correction=correct)[0]       n = sum(conf_matrix.sum())      phi2 = chi2/n      r,k = conf_matrix.shape      phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))          rcorr = r - ((r-1)**2)/(n-1)      kcorr = k - ((k-1)**2)/(n-1)      result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))  return round(result,6)

answered Oct 10 '22 11:10

Yury Wallet

Related questions
                            
                                Convert UUID 32-character hex string into a "YouTube-style" short id and back
                            
                                Read CSV file to numpy array, first row as strings, rest as float
                            
                                Highest Posterior Density Region and Central Credible Region
                            
                                Set no title for pandas boxplot (groupby)
                            
                                Read CSV items with column name
                            
                                How can I read pdf in python? [duplicate]
                            
                                Is timsort general-purpose or Python-specific?
                            
                                Install libxml2 and associated python bindings - Windows
                            
                                Where can I find source or algorithm of Python's hash() function?
                            
                                Python urllib2 URLError HTTP status code.
                            
                                How do I clear cache with Python Requests?
                            
                                How to suppress matplotlib warning?
                            
                                TypeError: Mismatch between array dtype ('object') and format specifier ('%.18e')
                            
                                Tensorboard not found as magic function in jupyter
                            
                                Adding a shebang causes No such file or directory error when running my python script
                            
                                running several system commands in parallel in Python
                            
                                Django filter many to many field in admin?
                            
                                Python 3 How do I 'declare' an empty `bytes` variable
                            
                                Ensuring Python logging in multiple threads is thread-safe
                            
                                Gaussian fit for Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using pandas, calculate Cramér's coefficient matrix

Tags:

python

pandas

statistics