I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn.
After removing labels from the training data, I add each row in CSV into a list like this:
for row in csv:
train_data.append(np.array(np.int64(row)))
I do the same for the test data.
I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?):
def preprocess(train_data, test_data, pca_components=100):
# convert to matrix
train_data = np.mat(train_data)
# reduce both train and test data
pca = decomposition.PCA(n_components=pca_components).fit(train_data)
X_train = pca.transform(train_data)
X_test = pca.transform(test_data)
return (X_train, X_test)
I then create a kNN classifier and fit it with the X_train
data and make predictions using the X_test
data.
Using this method I can get around 97% accuracy.
My question is about the dimensionality of the data before and after PCA is performed
What are the dimensions of train_data
and X_train
?
How does the number of components influence the dimensionality of the output? Are they the same thing?
PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data. PCA is applied on a data set with numeric variables.
Dimensionality: It is the number of features or variables present in the given dataset. More easily, it is the number of columns present in the dataset. Correlation: It signifies that how strongly two variables are related to each other.
PCA is affected by scale, so you need to scale the features in your data before applying PCA. Use StandardScaler from Scikit Learn to standardize the dataset features onto unit scale (mean = 0 and standard deviation = 1) which is a requirement for the optimal performance of many Machine Learning algorithms.
PCA helps us to identify patterns in data based on the correlation between features. In a nutshell, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original one.
The PCA algorithm finds the eigenvectors of the data's covariance matrix. What are eigenvectors? Nobody knows, and nobody cares (just kidding!). What's important is that the first eigenvector is a vector parallel to the direction along which the data has the largest variance (intuitively: spread). The second one denotes the second-best direction in terms of the maximum spread, and so on. Another important fact is that these vectors are orthogonal to each other, so they form a basis.
The pca_components
parameter tells the algorithm how many best basis vectors are you interested in. So, if you pass 100
it means you want to get 100
basis vectors that describe (statistician would say: explain) most of the variance of your data.
The transform
function transforms (srsly?;)) the data from the original basis to the basis formed by the chosen PCA components (in this example - the first best 100
vectors). You can visualize this as a cloud of points being rotated and having some of its dimensions ignored. As correctly pointed out by Jaime in the comments, this is equivalent of projecting the data onto the new basis.
For the 3D case, if you wanted to get a basis formed of the first 2
eigenvectors, then again, the 3D point cloud would be first rotated, so the most variance would be parallel to the coordinate axes. Then, the axis where the variance is smallest is being discarded, leaving you with 2D data.
So, to answer your question directly: yes, the number of the desired PCA components is the dimensionality of the output data (after the transformation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With