Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PCA For categorical features?

In my understanding, I thought PCA can be performed only for continuous features. But while trying to understand the difference between onehot encoding and label encoding came through a post in the following link:

When to use One Hot Encoding vs LabelEncoder vs DictVectorizor?

It states that one hot encoding followed by PCA is a very good method, which basically means PCA is applied for categorical features. Hence confused, please suggest me on the same.

like image 442
data_person Avatar asked Nov 24 '16 22:11

data_person


People also ask

Can I use PCA for categorical variables?

While it is technically possible to use PCA on discrete variables, or categorical variables that have been one hot encoded variables, you should not. Simply put, if your variables don't belong on a coordinate plane, then do not apply PCA to them.

What is categorical PCA?

Categorical principal components analysis could be used to graphically display the relationship between job category, job division, region, amount of travel (high, medium, and low), and job satisfaction. You might find that two dimensions account for a large amount of variance.

When should PCA not be used?

PCA should be used mainly for variables which are strongly correlated. If the relationship is weak between variables, PCA does not work well to reduce data. Refer to the correlation matrix to determine. In general, if most of the correlation coefficients are smaller than 0.3, PCA will not help.

Can PCA be used for qualitative data?

PCA to qualitative data, the alternating least squares (ALS) algorithm can be used as a quantification method.


2 Answers

I disagree with the others.

While you can use PCA on binary data (e.g. one-hot encoded data) that does not mean it is a good thing, or it will work very well.

PCA is designed for continuous variables. It tries to minimize variance (=squared deviations). The concept of squared deviations breaks down when you have binary variables.

So yes, you can use PCA. And yes, you get an output. It even is a least-squared output: it's not as if PCA would segfault on such data. It works, but it is just much less meaningful than you'd want it to be; and supposedly less meaningful than e.g. frequent pattern mining.

like image 57
Has QUIT--Anony-Mousse Avatar answered Oct 05 '22 21:10

Has QUIT--Anony-Mousse


MCA is a known technique for categorical data dimension reduction. In R there is a lot of package to use MCA and even mix with PCA in mixed contexts. In python exist a a mca library too. MCA apply similar maths that PCA, indeed the French statistician used to say, "data analysis is to find correct matrix to diagonalize"

http://gastonsanchez.com/visually-enforced/how-to/2012/10/13/MCA-in-R/

like image 41
joscani Avatar answered Oct 05 '22 20:10

joscani