Apply PCA and keep a percentage of the total variance

Question

I want to perform Principal Component Analysis on a particular dataset and then feed the principal components to a LogisticRegression classifier.

Specifically, I want to apply PCA and keep the 90% of the total variance, using the function computePrincipalComponentsAndExplainedVariance.

Here's the code for reading the dataset:

// Load the data
val text = sparkSession.sparkContext.textFile("dataset.data")        
val data = text.map(line => line.split(',').map(_.toDouble))  
// Separate to label and features
val dataLP = data.map(t => (t(57), Vectors.dense(t.take(57))))

I am not quite sure how to perform PCA so that the 90% of the total variance is maintained.

Shaido · Accepted Answer

Using the function computePrincipalComponentsAndExplainedVariance the return value will be a matrix as well as a vector with values indicating the variance explained for each principal component. From the documentation:

Returns: a matrix of size n-by-k, whose columns are principal components, and a vector of values which indicate how much variance each principal component explains

By using a large enough k as input, you can simply sum up the numbers in the vector until it's 90% or above, and then use that many columns from the matrix.

Apply PCA and keep a percentage of the total variance

Tags:

machine-learning

scala

apache-spark

pca

apache-spark-mllib

Giorgos Myrianthous

1 Answers

Shaido

Recent Activity

Donate For Us

Apply PCA and keep a percentage of the total variance

Tags:

machine-learning

scala

apache-spark

pca

apache-spark-mllib

Giorgos Myrianthous

1 Answers

Shaido

Related questions

Recent Activity

Donate For Us