PCA output in Spark doesn't matches with scikit-learn

Question

I am trying out PCA (principal component analysis) in Spark ML.

data = [(Vectors.dense([1.0, 1.0]),),
  (Vectors.dense([1.0, 2.0]),),
  (Vectors.dense([4.0, 4.0]),), 
  (Vectors.dense([5.0, 4.0]),)]

df = spark.createDataFrame(data, ["features"])
pca = PCA(k=1, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
transformed_feature = model.transform(df)
transformed_feature.show()

Output:

+---------+--------------------+
| features|         pcaFeatures|
+---------+--------------------+
|[1.0,1.0]|[-1.3949716649258...|
|[1.0,2.0]|[-1.976209858644928]|
|[4.0,4.0]|[-5.579886659703326]|
|[5.0,4.0]|[-6.393620130910061]|
+---------+--------------------+

When I tried PCA on same data in scikit-learn as below it given different result

X = np.array([[1.0, 1.0], [1.0, 2.0], [4.0, 4.0], [5.0, 4.0]])
pca = PCA(n_components=1)
pca.fit(X)
X_transformed = pca.transform(X)
for x,y in zip(X ,X_transformed):
    print(x,y)

Output:

[ 1.  1.] [-2.44120041]
[ 1.  2.] [-1.85996222]
[ 4.  4.] [ 1.74371458]
[ 5.  4.] [ 2.55744805]

As you can see there is a difference in output.

To verify the result i calculated PCA for the same data mathematically. I got same result as it from scikit-learn. Below snippet is of pca transformation calculation for first data point (1.0,1.0): enter image description here

as you can see it matches with scikit learn result.

It seems spark ML doesn't subtract the mean vector MX from data vector X i.e. it uses Y = A*(X) in place of Y = A*(X-MX).

For point (1.0,1.0):

Y = (0.814*1.0)+(0.581*1.0)) = 1.395

which is same result which we got with spark ML.

Is Spark ML is giving wrong result or am I missing something?

Shaido · Accepted Answer

In Spark, the PCA transformation will not scale the input data automatically for you. You need to take care of that yourself before applying the method. To normalize the mean of the data, StandardScaler can be used in the following way:

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                    withStd=False, withMean=True)
scaled_df = scaler.fit(df).transform(df)

The PCA method can then be applied on the scaled_df in the same way as before and the results will match what was given by scikit-learn.

I would recommend to make use of the Spark ML pipeline to simplify the process. To use the standardization and PCA together, it could look like this:

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                    withStd=False, withMean=True)
pca = PCA(k=1, inputCol=scaler.getOutputCol(), outputCol="pcaFeatures")
pipeline = Pipeline(stages=[scaler , pca])

model = pipeline.fit(df)
transformed_feature = model.transform(df)

PCA output in Spark doesn't matches with scikit-learn

Tags:

python

apache-spark

pyspark

pca

apache-spark-ml

Deepak Kumar

1 Answers

Shaido

Recent Activity

Donate For Us

PCA output in Spark doesn't matches with scikit-learn

Tags:

python

apache-spark

pyspark

pca

apache-spark-ml

Deepak Kumar

1 Answers

Shaido

Related questions

Recent Activity

Donate For Us