Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dimension of data before and after performing PCA

I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn.

After removing labels from the training data, I add each row in CSV into a list like this:

for row in csv:
    train_data.append(np.array(np.int64(row)))

I do the same for the test data.

I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?):

def preprocess(train_data, test_data, pca_components=100):
    # convert to matrix
    train_data = np.mat(train_data)

    # reduce both train and test data
    pca = decomposition.PCA(n_components=pca_components).fit(train_data)
    X_train = pca.transform(train_data)
    X_test = pca.transform(test_data)

    return (X_train, X_test)

I then create a kNN classifier and fit it with the X_train data and make predictions using the X_test data.

Using this method I can get around 97% accuracy.

My question is about the dimensionality of the data before and after PCA is performed

What are the dimensions of train_data and X_train?

How does the number of components influence the dimensionality of the output? Are they the same thing?

like image 788
jmz Avatar asked Nov 15 '13 12:11

jmz


People also ask

What is the right number of dimensions for PCA for data preparation?

PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data. PCA is applied on a data set with numeric variables.

What is dimension in PCA?

Dimensionality: It is the number of features or variables present in the given dataset. More easily, it is the number of columns present in the dataset. Correlation: It signifies that how strongly two variables are related to each other.

Should you scale data before PCA?

PCA is affected by scale, so you need to scale the features in your data before applying PCA. Use StandardScaler from Scikit Learn to standardize the dataset features onto unit scale (mean = 0 and standard deviation = 1) which is a requirement for the optimal performance of many Machine Learning algorithms.

Why PCA is reducing the dimension of a data set?

PCA helps us to identify patterns in data based on the correlation between features. In a nutshell, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original one.


1 Answers

The PCA algorithm finds the eigenvectors of the data's covariance matrix. What are eigenvectors? Nobody knows, and nobody cares (just kidding!). What's important is that the first eigenvector is a vector parallel to the direction along which the data has the largest variance (intuitively: spread). The second one denotes the second-best direction in terms of the maximum spread, and so on. Another important fact is that these vectors are orthogonal to each other, so they form a basis.

The pca_components parameter tells the algorithm how many best basis vectors are you interested in. So, if you pass 100 it means you want to get 100 basis vectors that describe (statistician would say: explain) most of the variance of your data.

The transform function transforms (srsly?;)) the data from the original basis to the basis formed by the chosen PCA components (in this example - the first best 100 vectors). You can visualize this as a cloud of points being rotated and having some of its dimensions ignored. As correctly pointed out by Jaime in the comments, this is equivalent of projecting the data onto the new basis.

For the 3D case, if you wanted to get a basis formed of the first 2 eigenvectors, then again, the 3D point cloud would be first rotated, so the most variance would be parallel to the coordinate axes. Then, the axis where the variance is smallest is being discarded, leaving you with 2D data.

So, to answer your question directly: yes, the number of the desired PCA components is the dimensionality of the output data (after the transformation).

like image 107
BartoszKP Avatar answered Oct 21 '22 10:10

BartoszKP