I am using PCA to find out which variables in my dataset are redundand due to being highly correlated with other variables. I am using princomp matlab function on the data previously normalized using zscore: <pre class="prettyprint"><code>[coeff, PC, eigenvalues] = princomp(zscore(x)) </code></pre> I know that eigenvalues tell me how much variation of the dataset covers every principal component, and that coeff tells me how much of i-th original variable is in the j-th principal component (where i - rows, j - columns). So I assumed that to find out which variables out of the original dataset are the most important and which are the least I should multiply the coeff matrix by eigenvalues - coeff values represent how much of every variable each component has and eigenvalues tell how important this component is. So this is my full code: <pre class="prettyprint"><code>[coeff, PC, eigenvalues] = princomp(zscore(x)); e = eigenvalues./sum(eigenvalues); abs(coeff)/e </code></pre> But this does not really show anything - I tried it on a following set, where variable 1 is fully correlated with variable 2 (v2 = v1 + 2): <pre class="prettyprint"><code> v1 v2 v3 1 3 4 2 4 -1 4 6 9 3 5 -2 </code></pre> but the results of my calculations were following: <pre class="prettyprint"><code>v1 0.5525 v2 0.5525 v3 0.5264 </code></pre> and this does not really show anything. I would expect the result for variable 2 show that it is far less important than v1 or v3. Which of my assuptions is wrong?

EDIT I have completely reworked the answer now that I understand which assumptions were wrong. Before explaining what doesn't work in the OP, let me make sure we'll have the same terminology. In principal component analysis, the goal is to obtain a coordinate transformation that separates the observations well, and that may make it easy to describe the data , i.e. the different multi-dimensional observations, in a lower-dimensional space. Observations are multidimensional when they're made up from multiple measurements. If there are fewer linearly independent observations than there are measurements, we expect at least one of the eigenvalues to be zero, because e.g. two linearly independent observation vectors in a 3D space can be described by a 2D plane. If we have an array <pre class="prettyprint"><code>x = [ 1 3 4 2 4 -1 4 6 9 3 5 -2]; </code></pre> that consists of four observations with three measurements each, <code>princomp(x)</code> will find the lower-dimensional space spanned by the four observations. Since there are two co-dependent measurements, one of the eigenvalues will be near zero, since the space of measurements is only 2D and not 3D, which is probably the result you wanted to find. Indeed, if you inspect the eigenvectors (<code>coeff</code>), you find that the first two components are extremely obviously collinear <pre class="prettyprint"><code>coeff = princomp(x) coeff = 0.10124 0.69982 0.70711 0.10124 0.69982 -0.70711 0.9897 -0.14317 1.1102e-16 </code></pre> Since the first two components are, in fact, pointing in opposite directions, the values of the first two components of the transformed observations are, on their own, meaningless: <code>[1 1 25]</code> is equivalent to <code>[1000 1000 25]</code>. Now, if we want to find out whether any measurements are linearly dependent, and if we really want to use principal components for this, because in real life, measurements my not be perfectly collinear and we are interested in finding good vectors of descriptors for a machine-learning application, it makes a lot more sense to consider the three measurements as "observations", and run <code>princomp(x')</code>. Since there are thus three "observations" only, but four "measurements", the fourth eigenvector will be zero. However, since there are two linearly dependent observations, we're left with only two non-zero eigenvalues: <pre class="prettyprint"><code>eigenvalues = 24.263 3.7368 0 0 </code></pre> To find out which of the measurements are so highly correlated (not actually necessary if you use the eigenvector-transformed measurements as input for e.g. machine learning), the best way would be to look at the correlation between the measurements: <pre class="prettyprint"><code>corr(x) ans = 1 1 0.35675 1 1 0.35675 0.35675 0.35675 1 </code></pre> Unsurprisingly, each measurement is perfectly correlated with itself, and <code>v1</code> is perfectly correlated with <code>v2</code>. EDIT2 <blockquote> but the eigenvalues tell us which vectors in the new space are most important (cover the most of variation) and also coefficients tell us how much of each variable is in each component. so I assume we can use this data to find out which of the original variables hold the most of variance and thus are most important (and get rid of those that represent small amount) </blockquote> This works if your observations show very little variance in one measurement variable (e.g. where <code>x = [1 2 3;1 4 22;1 25 -25;1 11 100];</code>, and thus the first variable contributes nothing to the variance). However, with collinear measurements, both vectors hold equivalent information, and contribute equally to the variance. Thus, the eigenvectors (coefficients) are likely to be similar to one another. <hr> In order for @agnieszka's comments to keep making sense, I have left the original points 1-4 of my answer below. Note that #3 was in response to the division of the eigenvectors by the eigenvalues, which to me didn't make a lot of sense. <ol> <li>the vectors should be in rows, not columns (each vector is an observation). </li> <li> <code>coeff</code> returns the basis vectors of the principal components, and its order has little to do with the original input</li> <li>To see the importance of the principal components, you use <code>eigenvalues/sum(eigenvalues)</code> </li> <li>If you have two collinear vectors, you can't say that the first is important and the second isn't. How do you know that it shouldn't be the other way around? If you want to test for colinearity, you should check the rank of the array instead, or call <code>unique</code> on normalized (i.e. <code>norm</code> equal to 1) vectors. </li> </ol>

Matlab: how to find which variables from dataset could be discarded using PCA in matlab?

Tags:

matlab

pca

princomp

I am using PCA to find out which variables in my dataset are redundand due to being highly correlated with other variables. I am using princomp matlab function on the data previously normalized using zscore:

[coeff, PC, eigenvalues] = princomp(zscore(x))

I know that eigenvalues tell me how much variation of the dataset covers every principal component, and that coeff tells me how much of i-th original variable is in the j-th principal component (where i - rows, j - columns).

So I assumed that to find out which variables out of the original dataset are the most important and which are the least I should multiply the coeff matrix by eigenvalues - coeff values represent how much of every variable each component has and eigenvalues tell how important this component is. So this is my full code:

[coeff, PC, eigenvalues] = princomp(zscore(x));
e = eigenvalues./sum(eigenvalues);
abs(coeff)/e

But this does not really show anything - I tried it on a following set, where variable 1 is fully correlated with variable 2 (v2 = v1 + 2):

     v1    v2    v3
     1     3     4
     2     4    -1
     4     6     9
     3     5    -2

but the results of my calculations were following:

v1 0.5525
v2 0.5525
v3 0.5264

and this does not really show anything. I would expect the result for variable 2 show that it is far less important than v1 or v3. Which of my assuptions is wrong?

275

asked Sep 28 '11 19:09

agnieszka

1 Answers

EDIT I have completely reworked the answer now that I understand which assumptions were wrong.

Before explaining what doesn't work in the OP, let me make sure we'll have the same terminology. In principal component analysis, the goal is to obtain a coordinate transformation that separates the observations well, and that may make it easy to describe the data , i.e. the different multi-dimensional observations, in a lower-dimensional space. Observations are multidimensional when they're made up from multiple measurements. If there are fewer linearly independent observations than there are measurements, we expect at least one of the eigenvalues to be zero, because e.g. two linearly independent observation vectors in a 3D space can be described by a 2D plane.

If we have an array

x = [    1     3     4
         2     4    -1
         4     6     9
         3     5    -2];

that consists of four observations with three measurements each, princomp(x) will find the lower-dimensional space spanned by the four observations. Since there are two co-dependent measurements, one of the eigenvalues will be near zero, since the space of measurements is only 2D and not 3D, which is probably the result you wanted to find. Indeed, if you inspect the eigenvectors (coeff), you find that the first two components are extremely obviously collinear

coeff = princomp(x)
coeff =
      0.10124      0.69982      0.70711
      0.10124      0.69982     -0.70711
       0.9897     -0.14317   1.1102e-16

Since the first two components are, in fact, pointing in opposite directions, the values of the first two components of the transformed observations are, on their own, meaningless: [1 1 25] is equivalent to [1000 1000 25].

Now, if we want to find out whether any measurements are linearly dependent, and if we really want to use principal components for this, because in real life, measurements my not be perfectly collinear and we are interested in finding good vectors of descriptors for a machine-learning application, it makes a lot more sense to consider the three measurements as "observations", and run princomp(x'). Since there are thus three "observations" only, but four "measurements", the fourth eigenvector will be zero. However, since there are two linearly dependent observations, we're left with only two non-zero eigenvalues:

eigenvalues =
       24.263
       3.7368
            0
            0

To find out which of the measurements are so highly correlated (not actually necessary if you use the eigenvector-transformed measurements as input for e.g. machine learning), the best way would be to look at the correlation between the measurements:

corr(x)
  ans =
        1            1      0.35675
        1            1      0.35675
  0.35675      0.35675            1

Unsurprisingly, each measurement is perfectly correlated with itself, and v1 is perfectly correlated with v2.

EDIT2

but the eigenvalues tell us which vectors in the new space are most important (cover the most of variation) and also coefficients tell us how much of each variable is in each component. so I assume we can use this data to find out which of the original variables hold the most of variance and thus are most important (and get rid of those that represent small amount)

This works if your observations show very little variance in one measurement variable (e.g. where x = [1 2 3;1 4 22;1 25 -25;1 11 100];, and thus the first variable contributes nothing to the variance). However, with collinear measurements, both vectors hold equivalent information, and contribute equally to the variance. Thus, the eigenvectors (coefficients) are likely to be similar to one another.

In order for @agnieszka's comments to keep making sense, I have left the original points 1-4 of my answer below. Note that #3 was in response to the division of the eigenvectors by the eigenvalues, which to me didn't make a lot of sense.

the vectors should be in rows, not columns (each vector is an observation).
coeff returns the basis vectors of the principal components, and its order has little to do with the original input
To see the importance of the principal components, you use eigenvalues/sum(eigenvalues)
If you have two collinear vectors, you can't say that the first is important and the second isn't. How do you know that it shouldn't be the other way around? If you want to test for colinearity, you should check the rank of the array instead, or call unique on normalized (i.e. norm equal to 1) vectors.

answered Nov 09 '22 23:11

Jonas

Related questions
                            
                                Drag and drop files in Matlab GUI
                            
                                Find location of current script (mlx-file) in MATLAB
                            
                                Surprised by "inconsistent" behaviour of Matlab's rank function on small, integer-valued matrices
                            
                                Problems with real-valued input deep belief networks (of RBMs)
                            
                                Feeding Data to Google Earth Plugin from MATLAB via COM
                            
                                improving performance of matlab code with anonymous-function bottlenecks
                            
                                Modeling hand movement as a 3D curve in Matlab/Java
                            
                                How to provide region of interest (ROI) for edge detection and corner detection in Matlab?
                            
                                Matlab Shared C/C++ lib in VS10/12 throws a lot exceptions
                            
                                Publishing .fig files without having them appear on screen
                            
                                get string of enum - matlab coder
                            
                                How to vectorize the intersection kernel function in MATLAB?
                            
                                MATLAB: How to create multiple mapped memory files with a simple "iterator"?
                            
                                Pause Matlab without breakpoint
                            
                                Matlab : Unable to get unique rationals when implementing a formula for binary to real number conversion Part1
                            
                                Achieve Matlab's `num2str` behaviour in Octave
                            
                                3-dimensional IRR in Matlab 2019
                            
                                Convert function to Simulink block
                            
                                MATLAB not throwing OutOfMemoryExceptions or Maximum variable size errors
                            
                                How to handle large images in matlab without running out of memory?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With