I want to perform Principal Component Analysis on a particular dataset and then feed the principal components to a LogisticRegression classifier.
Specifically, I want to apply PCA and keep the 90% of the total variance, using the function computePrincipalComponentsAndExplainedVariance.
Here's the code for reading the dataset:
// Load the data
val text = sparkSession.sparkContext.textFile("dataset.data")
val data = text.map(line => line.split(',').map(_.toDouble))
// Separate to label and features
val dataLP = data.map(t => (t(57), Vectors.dense(t.take(57))))
I am not quite sure how to perform PCA so that the 90% of the total variance is maintained.
Using the function computePrincipalComponentsAndExplainedVariance the return value will be a matrix as well as a vector with values indicating the variance explained for each principal component. From the documentation:
Returns: a matrix of size n-by-k, whose columns are principal components, and a vector of values which indicate how much variance each principal component explains
By using a large enough k as input, you can simply sum up the numbers in the vector until it's 90% or above, and then use that many columns from the matrix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With