I have a dataframe in pandas
which contains metrics calculated on Wikipedia articles. Two categorical variables nation
which nation the article is about, and lang
which language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.
index qid subj nation lang metric value 5 Q3488399 economy cdi fr informativeness 0.787117 6 Q3488399 economy cdi fr referencerate 0.000945 7 Q3488399 economy cdi fr completeness 43.200000 8 Q3488399 economy cdi fr numheadings 11.000000 9 Q3488399 economy cdi fr articlelength 3176.000000 10 Q7195441 economy cdi en informativeness 0.626570 11 Q7195441 economy cdi en referencerate 0.008610 12 Q7195441 economy cdi en completeness 6.400000 13 Q7195441 economy cdi en numheadings 7.000000 14 Q7195441 economy cdi en articlelength 2323.000000
I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga']
and three languages ['fr','en','sw']
. So there would be a resulting 4 by 3 matrix like:
en fr sw usa Cramer11 Cramer12 ... fra Cramer21 Cramer22 ... cdi ... uga ...
Eventually then I will do this over all the different metrics I am tracking.
for subject in list_of_subjects: for metric in list_of_metrics: cramer_matrix(metric, df)
Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia. Thanks
Pandas makes it very easy to find the correlation coefficient! We can simply call the . corr() method on the dataframe of interest. The method returns a correlation matrix that shows the coefficient of correlation between different variables.
Method 1: Creating a correlation matrix using Numpy libraryNumpy library make use of corrcoef() function that returns a matrix of 2×2. The matrix consists of correlations of x with x (0,0), x with y (0,1), y with x (1,0) and y with y (1,1).
The corr() method finds the correlation of each column in a DataFrame.
cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.
import scipy.stats as ss def cramers_corrected_stat(confusion_matrix): """ calculate Cramers V statistic for categorial-categorial association. uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328 """ chi2 = ss.chi2_contingency(confusion_matrix)[0] n = confusion_matrix.sum() phi2 = chi2/n r,k = confusion_matrix.shape phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1)) rcorr = r - ((r-1)**2)/(n-1) kcorr = k - ((k-1)**2)/(n-1) return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:
import pandas as pd confusion_matrix = pd.crosstab(df[column1], df[column2])
A bit modificated function from Ziggy Eunicien answer. 2 modifications added
checking if one of the variables is constant
correction to ss.chi2_contingency(conf_matrix, correction=correct) - FALSE if confusion matrix is 2x2
import scipy.stats as ss import pandas as pd import numpy as np def cramers_corrected_stat(x,y):
""" calculate Cramers V statistic for categorial-categorial association. uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328 """ result=-1 if len(x.value_counts())==1 : print("First variable is constant") elif len(y.value_counts())==1: print("Second variable is constant") else: conf_matrix=pd.crosstab(x, y) if conf_matrix.shape[0]==2: correct=False else: correct=True chi2 = ss.chi2_contingency(conf_matrix, correction=correct)[0] n = sum(conf_matrix.sum()) phi2 = chi2/n r,k = conf_matrix.shape phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1)) rcorr = r - ((r-1)**2)/(n-1) kcorr = k - ((k-1)**2)/(n-1) result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))) return round(result,6)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With