Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to correlate an Ordinal Categorical column in pandas?

Tags:

I have a DataFrame df with a non-numerical column CatColumn.

   A         B         CatColumn 0  381.1396  7.343921  Medium 1  481.3268  6.786945  Medium 2  263.3766  7.628746  High 3  177.2400  5.225647  Medium-High 

I want to include CatColumn in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr but it does not include columns with nominal values in the correlation analysis.

like image 490
yousraHazem Avatar asked Dec 19 '17 20:12

yousraHazem


People also ask

Can you do correlation with ordinal data?

The Pearson's correlation coefficient measures linear correlation between two continuous variables. Values obtained using an ordinal scale are NOT continuous but their corresponding ranks are. Hence, you can still use the Pearson's correlation coefficient on those ranks.

Can you run a correlation with a categorical variable?

Further, if either variable of the pair is categorical, we can't use the correlation coefficient. We will have to turn to other metrics. If x and y are both categorical, we can try Cramer's V or the phi coefficient.

How do you find the correlation between ordinal variables?

According to the (Research Methods for Business Students) book, to assess the relationship between two ordinal variables is by using Spearman's rank correlation coefficient (Spearman's rho) or Kendall's rank-order correlation coefficient (Kendall's tau).


1 Answers

I am going to strongly disagree with the other comments.

They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.

Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.

rawText = StringIO("""  A         B         C 0  100.1396  1.343921  Medium 1  105.3268  1.786945  Medium 2  200.3766  9.628746  High 3  150.2400  4.225647  Medium-High """) myData = pd.read_csv(rawText, sep = "\s+") 

Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:

In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr() Out[226]:            A         B         C A  1.000000  0.986493 -0.438466 B  0.986493  1.000000 -0.579650 C -0.438466 -0.579650  1.000000 

Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?

What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:

In[227]: myData['C'] = myData['C'].astype('category') myData['C'].cat.categories = [2,0,1] myData['C'] = myData['C'].astype('float') myData.corr() Out[227]:            A         B         C A  1.000000  0.986493  0.998874 B  0.986493  1.000000  0.982982 C  0.998874  0.982982  1.000000 

Much better!

Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.

Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.

like image 175
FatihAkici Avatar answered Dec 31 '22 22:12

FatihAkici