PCA in Spark MLlib and Spark ML

Tags:

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility.

My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns

spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.

Also, if you look at the Java code example there is also this

The number of columns should be small, e.g, less than 1000.

On the other hand, if you look at ML documentation, there are no limitations mentioned.

So, my question is - does this limitation also exists in Spark ML? And if so, why the limitation and is there any workaround to be able to use this implementation even if the number of columns is large?

964

asked Oct 26 '16 12:10

Kobe-Wan Kenobi

1 Answers

PCA consists in finding a set of decorrelated random variables that you can represent your data with, sorted in decreasing order with respect to the amount of variance they retain.

These variables can be found by projecting your data points onto a specific orthogonal subspace. If your (mean-centered) data matrix is X, this subspace is comprised of the eigenvectors of X^T X.

When X is large, say of dimensions n x d, you can compute X^T X by computing the outer product of each row of the matrix by itself, then adding all the results up. This is of course amenable to a simple map-reduce procedure if d is small, no matter how large n is. That's because the outer product of each row by itself is a d x d matrix, which will have to be manipulated in main memory by each worker. That's why you might run into trouble when handling many columns.

If the number of columns is large (and the number of rows not so much so) you can indeed compute PCA. Just compute the SVD of your (mean-centered) transposed data matrix and multiply it by the resulting eigenvectors and the inverse of the diagonal matrix of eigenvalues. There's your orthogonal subspace.

Bottom line: if the spark.ml implementation follows the first approach every time, then the limitation should be the same. If they check the dimensions of the input dataset to decide whether they should go for the second approach, then you won't have problems dealing with large numbers of columns if the number of rows is small.

Regardless of that, the limit is imposed by how much memory your workers have, so perhaps they let users hit the ceiling by themselves, rather than suggesting a limitation that may not apply for some. That might be the reason why they decided not to mention the limitation in the new docs.

Update: The source code reveals that they do take the first approach every time, regardless of the dimensionality of the input. The actual limit is 65535, and at 10,000 they issue a warning.

197

answered Oct 20 '22 02:10

cangrejo

Related questions
                            
                                Spark, optimize metrics generation from DF
                            
                                Write Dataframe to Phoenix
                            
                                Including a Spark Package JAR file in a SBT generated fat JAR
                            
                                Setting up a Spark SQL connection with Kerberos
                            
                                Spark and Hive table schema out of sync after external overwrite
                            
                                Should I persist a Spark dataframe if I keep adding columns in it?
                            
                                Read a bytes column in spark
                            
                                How to solve an assignment problem (like Hungarian/linear_sum_assignment) with an edge case in PySpark UDF
                            
                                Apache Spark: distinct doesnt work?
                            
                                How to do time-series simple forecast?
                            
                                How do I process a graph that is constantly updating, with low latency?
                            
                                Is it necessary to submit spark application jar?
                            
                                Elaboration on why shuffle write data is way more then input data in apache spark
                            
                                How to clean up other resources when spark gets stopped
                            
                                Amazon EMR - how to set a timeout for a step
                            
                                Does Spark allow to use Amazon Assumed Role and STS temporary credentials for DynamoDB?
                            
                                Pyspark read csv with schema, header check, and store corrupt records
                            
                                How to avoid one Spark Streaming window blocking another window with both running some native Python code
                            
                                Prevent more IO with multiple pipelines on the same RDD

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PCA in Spark MLlib and Spark ML

Tags:

apache-spark

apache-spark-ml

apache-spark-mllib

Kobe-Wan Kenobi

People also ask

1 Answers

cangrejo

Recent Activity

Donate For Us