I have a DataFrame <code>df</code> with a non-numerical column <code>CatColumn</code>. <pre class="prettyprint"><code> A B CatColumn 0 381.1396 7.343921 Medium 1 481.3268 6.786945 Medium 2 263.3766 7.628746 High 3 177.2400 5.225647 Medium-High </code></pre> I want to include <code>CatColumn</code> in the correlation analysis with other columns in the Dataframe. I tried <code>DataFrame.corr</code> but it does not include columns with nominal values in the correlation analysis.

I am going to strongly disagree with the other comments. They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable. Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question. <pre class="prettyprint"><code>rawText = StringIO(""" A B C 0 100.1396 1.343921 Medium 1 105.3268 1.786945 Medium 2 200.3766 9.628746 High 3 150.2400 4.225647 Medium-High """) myData = pd.read_csv(rawText, sep = "\s+") </code></pre> Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers: <pre class="prettyprint"><code>In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr() Out[226]: A B C A 1.000000 0.986493 -0.438466 B 0.986493 1.000000 -0.579650 C -0.438466 -0.579650 1.000000 </code></pre> Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on? What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how: <pre class="prettyprint"><code>In[227]: myData['C'] = myData['C'].astype('category') myData['C'].cat.categories = [2,0,1] myData['C'] = myData['C'].astype('float') myData.corr() Out[227]: A B C A 1.000000 0.986493 0.998874 B 0.986493 1.000000 0.982982 C 0.998874 0.982982 1.000000 </code></pre> Much better! Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though. Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.

How to correlate an Ordinal Categorical column in pandas?

Tags:

I have a DataFrame df with a non-numerical column CatColumn.

   A         B         CatColumn 0  381.1396  7.343921  Medium 1  481.3268  6.786945  Medium 2  263.3766  7.628746  High 3  177.2400  5.225647  Medium-High

I want to include CatColumn in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr but it does not include columns with nominal values in the correlation analysis.

490

asked Dec 19 '17 20:12

yousraHazem

1 Answers

I am going to strongly disagree with the other comments.

They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.

Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.

rawText = StringIO("""  A         B         C 0  100.1396  1.343921  Medium 1  105.3268  1.786945  Medium 2  200.3766  9.628746  High 3  150.2400  4.225647  Medium-High """) myData = pd.read_csv(rawText, sep = "\s+")

Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:

In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr() Out[226]:            A         B         C A  1.000000  0.986493 -0.438466 B  0.986493  1.000000 -0.579650 C -0.438466 -0.579650  1.000000

Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?

What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:

In[227]: myData['C'] = myData['C'].astype('category') myData['C'].cat.categories = [2,0,1] myData['C'] = myData['C'].astype('float') myData.corr() Out[227]:            A         B         C A  1.000000  0.986493  0.998874 B  0.986493  1.000000  0.982982 C  0.998874  0.982982  1.000000

Much better!

Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.

Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.

175

answered Dec 31 '22 22:12

FatihAkici

Related questions
                            
                                How do I implement Queryable and Insertable for custom field types in Diesel?
                            
                                How do I use parameters in VBA in the different contexts in Microsoft Access?
                            
                                How to hyperlink in a Jupyter notebook?
                            
                                Program type already present: com.android.vending.billing.IInAppBillingService
                            
                                Convolution2D + LSTM versus ConvLSTM2D
                            
                                Why `void* = 0` and `void* = nullptr` makes the difference?
                            
                                Proper way to build menus with python-telegram-bot
                            
                                flake8 - ignore warnings for a function
                            
                                Visual Studio Code: running preLaunchTask with multiple tasks
                            
                                How to reindex a MultiIndex dataframe
                            
                                syntax 'nullishCoalescingOperator' isn't currently enabled
                            
                                How do I remove the "Unresolved issue" banner from App Store Connect?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With