Dimension of data before and after performing PCA

Tags:

I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn.

After removing labels from the training data, I add each row in CSV into a list like this:

for row in csv:
    train_data.append(np.array(np.int64(row)))

I do the same for the test data.

I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?):

def preprocess(train_data, test_data, pca_components=100):
    # convert to matrix
    train_data = np.mat(train_data)

    # reduce both train and test data
    pca = decomposition.PCA(n_components=pca_components).fit(train_data)
    X_train = pca.transform(train_data)
    X_test = pca.transform(test_data)

    return (X_train, X_test)

I then create a kNN classifier and fit it with the X_train data and make predictions using the X_test data.

Using this method I can get around 97% accuracy.

My question is about the dimensionality of the data before and after PCA is performed

What are the dimensions of train_data and X_train?

How does the number of components influence the dimensionality of the output? Are they the same thing?

788

asked Nov 15 '13 12:11

jmz

1 Answers

The PCA algorithm finds the eigenvectors of the data's covariance matrix. What are eigenvectors? Nobody knows, and nobody cares (just kidding!). What's important is that the first eigenvector is a vector parallel to the direction along which the data has the largest variance (intuitively: spread). The second one denotes the second-best direction in terms of the maximum spread, and so on. Another important fact is that these vectors are orthogonal to each other, so they form a basis.

The pca_components parameter tells the algorithm how many best basis vectors are you interested in. So, if you pass 100 it means you want to get 100 basis vectors that describe (statistician would say: explain) most of the variance of your data.

The transform function transforms (srsly?;)) the data from the original basis to the basis formed by the chosen PCA components (in this example - the first best 100 vectors). You can visualize this as a cloud of points being rotated and having some of its dimensions ignored. As correctly pointed out by Jaime in the comments, this is equivalent of projecting the data onto the new basis.

For the 3D case, if you wanted to get a basis formed of the first 2 eigenvectors, then again, the 3D point cloud would be first rotated, so the most variance would be parallel to the coordinate axes. Then, the axis where the variance is smallest is being discarded, leaving you with 2D data.

So, to answer your question directly: yes, the number of the desired PCA components is the dimensionality of the output data (after the transformation).

107

answered Oct 21 '22 10:10

BartoszKP

Related questions
                            
                                Is there a good way to download scipy, numpy, matplotlib, and pandas documentation for pylookup?
                            
                                How do I generate a spectrogram of a 1D signal in python?
                            
                                python re find string that may contain brackets
                            
                                Python using lambda to apply pd.DataFrame instead for nested loop is it possible?
                            
                                How to find possible English words in long random string?
                            
                                Does python support unicode beyond basic multilingual plane?
                            
                                Biopython SeqIO to Pandas Dataframe
                            
                                ZeroMQ: have to sleep before send
                            
                                How to handle "413: Request Entity Too Large" in python flask server
                            
                                Reference next item in list: python
                            
                                How do I build reusable widgets in jinja2?
                            
                                Does a function which takes iterable as parameter always accept iterator?
                            
                                How to run WindowCommand plugin from `sublime console`
                            
                                Socket.IO vs. Twisted [closed]
                            
                                Python: launch default mail client on the system
                            
                                In Python Celery, how do I persist objects across consecutive worker calls?
                            
                                django SyntaxError: keyword can't be an expression
                            
                                Compute on pandas dataframe concurrently
                            
                                Convert a string equation to an integer answer
                            
                                Python: how to read a config file with only keys (no values)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dimension of data before and after performing PCA

Tags:

python

numpy

scikit-learn

pca

jmz

People also ask

1 Answers

BartoszKP

Recent Activity

Donate For Us