Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Principal Component Analysis Tutorial - Convert R code to Matlab issues

Tags:

r

matlab

pca

I am trying to understand PCA by finding practical examples online. Sadly most tutorials I have found don't really seem to show simple practical applications of PCA. After a lot of searching, I came across this

http://yatani.jp/HCIstats/PCA

It is a nice simple tutorial. I want to re-create the results in Matlab, but the tutorial is in R. I have been trying to replicate the results in Matlab, but have been so far unsuccessful; I am new to Matlab. I have created the arrays as follows:

Price = [6,7,6,5,7,6,5,6,3,1,2,5,2,3,1,2];
Software = [5,3,4,7,7,4,7,5,5,3,6,7,4,5,6,3];
Aesthetics = [3,2,4,1,5,2,2,4,6,7,6,7,5,6,5,7];
Brand = [4,2,5,3,5,3,1,4,7,5,7,6,6,5,5,7];

Then in his example, he does this

data <- data.frame(Price, Software, Aesthetics, Brand)

I did a quick search online, and this apparently converts vectors into a data table in R code. So in Matlab I did this

dataTable(:,1) = Price;
dataTable(:,2) = Software;
dataTable(:,3) = Aesthetics;
dataTable(:,4) = Brand;

Now it is the next part I am unsure of.

pca <- princomp(data, cor=TRUE)
summary(pca, loadings=TRUE)

I have tried using Matlab's PCA function

 [COEFF SCORE LATENT] = princomp(dataTable)

But my results do not match the ones shown in the tutorial at all. My results are

COEFF =

   -0.5958    0.3786    0.7065   -0.0511
   -0.1085    0.8343   -0.5402   -0.0210
    0.6053    0.2675    0.3179   -0.6789
    0.5166    0.2985    0.3287    0.7321


SCORE =

   -2.3362    0.0276    0.6113    0.4237
   -4.3534   -2.1268    1.4228   -0.3707
   -1.1057   -0.2406    1.7981    0.4979
   -3.6847    0.4840   -2.1400    1.0586
   -1.4218    2.9083    1.2020   -0.2952
   -3.3495   -1.3726    0.5049    0.3916
   -4.1126    0.1546   -2.4795   -1.0846
   -1.7309    0.2951    0.9293   -0.2552
    2.8169    0.5898    0.4318    0.7366
    3.7976   -2.1655   -0.2402   -1.2622
    3.3041    1.0454   -0.8148    0.7667
    1.4969    2.9845    0.7537   -0.8187
    2.3993   -1.1891   -0.3811    0.7556
    1.7836   -0.0072   -0.2255   -0.7276
    2.2613   -0.1977   -2.4966    0.0326
    4.2350   -1.1899    1.1236    0.1509


LATENT =

    9.3241
    2.2117
    1.8727
    0.5124 

Yet the results in the tutorial are

Importance of components:
            Comp.1    Comp.2    Comp.3     Comp.4
Standard deviation     1.5589391 0.9804092 0.6816673 0.37925777
Proportion of Variance 0.6075727 0.2403006 0.1161676 0.03595911
Cumulative Proportion  0.6075727 0.8478733 0.9640409 1.00000000

Loadings:
        Comp.1 Comp.2 Comp.3 Comp.4
Price      -0.523         0.848       
Software   -0.177  0.977 -0.120       
Aesthetics  0.597  0.134  0.295 -0.734
Brand       0.583  0.167  0.423  0.674

Could anyone please explain why my results differ so much from the tutorial. Am I using the wrong Matlab function?

Also if you are able to provide any other nice simple practical applications of PCA, would be very beneficial. Still trying to get my head around all the concepts in PCA and I like examples where I can code it and see the results myself, so I can play about with it, I find it is easier when to learn this way

Any help would be much appreciated!!

like image 536
AdamM Avatar asked Oct 03 '22 17:10

AdamM


1 Answers

Edit: The issue is purely the scaling.

R code:

summary(princomp(data, cor = FALSE), loadings=T, cutoff = 0.01)

Loadings:
           Comp.1 Comp.2 Comp.3 Comp.4
Price      -0.596 -0.379  0.706 -0.051
Software   -0.109 -0.834 -0.540 -0.021
Aesthetics  0.605 -0.268  0.318 -0.679
Brand       0.517 -0.298  0.329  0.732

According to the Matlab help you should use this if you want scaling:

Matlab code:

princomp(zscore(X))

Old answer (a red herring):

From help(princomp) (in R):

The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result. A preferred method of calculation is to use svd on x, as is done in prcomp.

Note that the default calculation uses divisor N for the covariance matrix.

In the documentation of the R function prcomp (help(prcomp)) you can read:

The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy. [...] Unlike princomp, variances are computed with the usual divisor N - 1.

The Matlab function apparently uses the svd algorithm. If I use prcom (without scaling, i.e., not based on correlations) with the example data I get:

> prcomp(data)
Standard deviations:
[1] 3.0535362 1.4871803 1.3684570 0.7158006

Rotation:
                  PC1       PC2        PC3         PC4
Price      -0.5957661 0.3786184 -0.7064672  0.05113761
Software   -0.1085472 0.8342628  0.5401678  0.02101742
Aesthetics  0.6053008 0.2675111 -0.3179391  0.67894297
Brand       0.5166152 0.2984819 -0.3286908 -0.73210631

This is (appart from the irrelevant signs) identical to the Matlab output.

like image 139
Roland Avatar answered Oct 13 '22 11:10

Roland