I am using PCA to find out which variables in my dataset are redundand due to being highly correlated with other variables. I am using princomp matlab function on the data previously normalized using zscore:
[coeff, PC, eigenvalues] = princomp(zscore(x))
I know that eigenvalues tell me how much variation of the dataset covers every principal component, and that coeff tells me how much of i-th original variable is in the j-th principal component (where i - rows, j - columns).
So I assumed that to find out which variables out of the original dataset are the most important and which are the least I should multiply the coeff matrix by eigenvalues - coeff values represent how much of every variable each component has and eigenvalues tell how important this component is. So this is my full code:
[coeff, PC, eigenvalues] = princomp(zscore(x));
e = eigenvalues./sum(eigenvalues);
abs(coeff)/e
But this does not really show anything - I tried it on a following set, where variable 1 is fully correlated with variable 2 (v2 = v1 + 2):
v1 v2 v3
1 3 4
2 4 -1
4 6 9
3 5 -2
but the results of my calculations were following:
v1 0.5525
v2 0.5525
v3 0.5264
and this does not really show anything. I would expect the result for variable 2 show that it is far less important than v1 or v3. Which of my assuptions is wrong?
Input to the PCA can be any set of numerical variables, however they should be scaled to each other and traditional PCA will not accept any missing data points. Data points will be scored by how well they fit into a principal component (PC) based upon a measure of variance within the dataset.
PCA is used to remove multicollinearity from the data. As far as I know there is no point in removing correlated variables. If there are correlated variables, then PCA replaces them with a principle component which can explain max variance.
PCA is a technique to account for the variability of the system from the linear combination of independent variables, thus it should not include dependent variables.
EDIT I have completely reworked the answer now that I understand which assumptions were wrong.
Before explaining what doesn't work in the OP, let me make sure we'll have the same terminology. In principal component analysis, the goal is to obtain a coordinate transformation that separates the observations well, and that may make it easy to describe the data , i.e. the different multi-dimensional observations, in a lower-dimensional space. Observations are multidimensional when they're made up from multiple measurements. If there are fewer linearly independent observations than there are measurements, we expect at least one of the eigenvalues to be zero, because e.g. two linearly independent observation vectors in a 3D space can be described by a 2D plane.
If we have an array
x = [ 1 3 4
2 4 -1
4 6 9
3 5 -2];
that consists of four observations with three measurements each, princomp(x)
will find the lower-dimensional space spanned by the four observations. Since there are two co-dependent measurements, one of the eigenvalues will be near zero, since the space of measurements is only 2D and not 3D, which is probably the result you wanted to find. Indeed, if you inspect the eigenvectors (coeff
), you find that the first two components are extremely obviously collinear
coeff = princomp(x)
coeff =
0.10124 0.69982 0.70711
0.10124 0.69982 -0.70711
0.9897 -0.14317 1.1102e-16
Since the first two components are, in fact, pointing in opposite directions, the values of the first two components of the transformed observations are, on their own, meaningless: [1 1 25]
is equivalent to [1000 1000 25]
.
Now, if we want to find out whether any measurements are linearly dependent, and if we really want to use principal components for this, because in real life, measurements my not be perfectly collinear and we are interested in finding good vectors of descriptors for a machine-learning application, it makes a lot more sense to consider the three measurements as "observations", and run princomp(x')
. Since there are thus three "observations" only, but four "measurements", the fourth eigenvector will be zero. However, since there are two linearly dependent observations, we're left with only two non-zero eigenvalues:
eigenvalues =
24.263
3.7368
0
0
To find out which of the measurements are so highly correlated (not actually necessary if you use the eigenvector-transformed measurements as input for e.g. machine learning), the best way would be to look at the correlation between the measurements:
corr(x)
ans =
1 1 0.35675
1 1 0.35675
0.35675 0.35675 1
Unsurprisingly, each measurement is perfectly correlated with itself, and v1
is perfectly correlated with v2
.
EDIT2
but the eigenvalues tell us which vectors in the new space are most important (cover the most of variation) and also coefficients tell us how much of each variable is in each component. so I assume we can use this data to find out which of the original variables hold the most of variance and thus are most important (and get rid of those that represent small amount)
This works if your observations show very little variance in one measurement variable (e.g. where x = [1 2 3;1 4 22;1 25 -25;1 11 100];
, and thus the first variable contributes nothing to the variance). However, with collinear measurements, both vectors hold equivalent information, and contribute equally to the variance. Thus, the eigenvectors (coefficients) are likely to be similar to one another.
In order for @agnieszka's comments to keep making sense, I have left the original points 1-4 of my answer below. Note that #3 was in response to the division of the eigenvectors by the eigenvalues, which to me didn't make a lot of sense.
coeff
returns the basis vectors of the principal
components, and its order has little to do with the original inputeigenvalues/sum(eigenvalues)
unique
on normalized (i.e. norm
equal to 1) vectors. If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With