Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn principal component analysis (PCA) for dimension reduction

I want to perform principal component analysis for dimension reduction and data integration.

I have 3 features(variables) and 5 samples like below. I want to integrate them into 1-dimensional(1 feature) output by transforming them(computing 1st PC). I want to use transformed data for further statistical analysis, because I believe that it displays the 'main' characteristics of 3 input features.

I first wrote a test code with python using scikit-learn like below. It is the simple case that the values of 3 features are all equivalent. In other word, I applied PCA for three same vector, [0, 1, 2, 1, 0].

Code

import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
samples = np.array([[0,0,0],[1,1,1],[2,2,2],[1,1,1],[0,0,0]])
pc1 = pca.fit_transform(samples)
print (pc1)

Output

[[-1.38564065]
[ 0.34641016]
[ 2.07846097]
[ 0.34641016]
[-1.38564065]]
  1. Is taking 1st PCA after dimension reduction proper approach for data integration?

1-2. For example, if features are like [power rank, speed rank], and power have roughly negative correlation with speed, when it is a 2-feature case. I want to know the sample which have both 'high power' and 'high speed'. It is easy to decide that [power 1, speed 1] is better than [power 2, speed 2], but difficult for the case like [power 4, speed 2] vs [power 3, speed 3]. So I want to apply PCA to 2-dimensional 'power and speed' dataset, and take 1st PC, then use the rank of '1st PC'. Is this kind of approach still proper?

  1. In this case, I think the output should also be [0, 1, 2, 1, 0] which is the same as the input. But output was [-1.38564065, 0.34641016, 2.07846097, 0.34641016, -1.38564065]. Are there any problem with the code, or is it the right answer?
like image 841
z991 Avatar asked Oct 12 '17 06:10

z991


People also ask

Can PCA principal component analysis be used for reducing dimension?

Principal Component Analysis(PCA) is one of the most popular linear dimension reduction. Sometimes, it is used alone and sometimes as a starting solution for other dimension reduction methods. PCA is a projection based method which transforms the data by projecting it onto a set of orthogonal axes.

Is PCA good for dimensionality reduction?

Many beginner Data Scientists have their first contact with the algorithm learning that it is good for dimensionality reduction, meaning that when we have a wide dataset, with many variables, we can use PCA to transform our data to as many components as we want, therefore reducing it before predictions.

How do you apply principal component analysis PCA when solving problems using Scikit learn?

Performing PCA using Scikit-Learn is a two-step process: Initialize the PCA class by passing the number of components to the constructor. Call the fit and then transform methods by passing the feature set to these methods. The transform method returns the specified number of principal components.

How does PCA reduce the number of dimensions of an image?

As a result of summarizing the preliminary literature, dimension reduction process by PCA generally consists of four major steps: (1) normalize image data (2) calculate covariance matrix from the image data (3) perform Single Value Decomposition (SVD) (4) find the projection of image data to the new basis with reduced ...


1 Answers

  1. Yes. It is also called data projection (to the lower dimension).
  2. The resulting output is centered and normalized according to the train data. The result is correct.

In case of only 5 samples I don't think it is wise to run any statistical methods. And if you believe that your features are the same, just check that correlation between dimensions is close to 1, and then you can just disregard other dimensions.

like image 118
igrinis Avatar answered Oct 14 '22 19:10

igrinis