Scikit-learn principal component analysis (PCA) for dimension reduction

Tags:

I want to perform principal component analysis for dimension reduction and data integration.

I have 3 features(variables) and 5 samples like below. I want to integrate them into 1-dimensional(1 feature) output by transforming them(computing 1st PC). I want to use transformed data for further statistical analysis, because I believe that it displays the 'main' characteristics of 3 input features.

I first wrote a test code with python using scikit-learn like below. It is the simple case that the values of 3 features are all equivalent. In other word, I applied PCA for three same vector, [0, 1, 2, 1, 0].

Code

import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
samples = np.array([[0,0,0],[1,1,1],[2,2,2],[1,1,1],[0,0,0]])
pc1 = pca.fit_transform(samples)
print (pc1)

Output

[[-1.38564065]
[ 0.34641016]
[ 2.07846097]
[ 0.34641016]
[-1.38564065]]

Is taking 1st PCA after dimension reduction proper approach for data integration?

1-2. For example, if features are like [power rank, speed rank], and power have roughly negative correlation with speed, when it is a 2-feature case. I want to know the sample which have both 'high power' and 'high speed'. It is easy to decide that [power 1, speed 1] is better than [power 2, speed 2], but difficult for the case like [power 4, speed 2] vs [power 3, speed 3]. So I want to apply PCA to 2-dimensional 'power and speed' dataset, and take 1st PC, then use the rank of '1st PC'. Is this kind of approach still proper?

In this case, I think the output should also be [0, 1, 2, 1, 0] which is the same as the input. But output was [-1.38564065, 0.34641016, 2.07846097, 0.34641016, -1.38564065]. Are there any problem with the code, or is it the right answer?

841

asked Oct 12 '17 06:10

z991

1 Answers

Yes. It is also called data projection (to the lower dimension).
The resulting output is centered and normalized according to the train data. The result is correct.

In case of only 5 samples I don't think it is wise to run any statistical methods. And if you believe that your features are the same, just check that correlation between dimensions is close to 1, and then you can just disregard other dimensions.

118

answered Oct 14 '22 19:10

igrinis

Related questions
                            
                                AttributeError: 'Series' object has no attribute 'rolling'
                            
                                Add months to date column in Spark dataframe
                            
                                Replacing only the captured group using re.sub and multiple replacements
                            
                                tensorflow object detection Fine-tuning a model from an existing checkpoint
                            
                                Updating z data on a surface_plot in Matplotlib animation
                            
                                Should super always be at the top of an __init__ method, or can it be at the bottom?
                            
                                How do I get the number of likes on a tweet via tweepy?
                            
                                Conda build unsatisfiable dependencies error with pint
                            
                                Why do I get warning "QStandardPaths: XDG_RUNTIME_DIR not set" every time for a PyQt5 project
                            
                                pandas apply function on multiindex
                            
                                Running Matlab using Python gives 'No module named matlab.engine' error
                            
                                Python’s empty function does not require a pass statement? [closed]
                            
                                Using numpy in AWS Lambda
                            
                                Scheduling a python script on Azure
                            
                                How to mock getenv in pytest?
                            
                                SQlite3 - Delete row by rowid
                            
                                PyQt4 to PyQt5 how?
                            
                                What is the best way to save sklearn model?
                            
                                What is the difference between single and double bracket Numpy array?
                            
                                How to add template variable in the filename of an EmailOperator task? (Airflow)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit-learn principal component analysis (PCA) for dimension reduction

Tags:

python

scikit-learn

pca

feature-extraction

z991

People also ask

1 Answers

igrinis

Recent Activity

Donate For Us