Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PCA output in Spark doesn't matches with scikit-learn

I am trying out PCA (principal component analysis) in Spark ML.

data = [(Vectors.dense([1.0, 1.0]),),
  (Vectors.dense([1.0, 2.0]),),
  (Vectors.dense([4.0, 4.0]),), 
  (Vectors.dense([5.0, 4.0]),)]

df = spark.createDataFrame(data, ["features"])
pca = PCA(k=1, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
transformed_feature = model.transform(df)
transformed_feature.show()

Output:

+---------+--------------------+
| features|         pcaFeatures|
+---------+--------------------+
|[1.0,1.0]|[-1.3949716649258...|
|[1.0,2.0]|[-1.976209858644928]|
|[4.0,4.0]|[-5.579886659703326]|
|[5.0,4.0]|[-6.393620130910061]|
+---------+--------------------+

When I tried PCA on same data in scikit-learn as below it given different result

X = np.array([[1.0, 1.0], [1.0, 2.0], [4.0, 4.0], [5.0, 4.0]])
pca = PCA(n_components=1)
pca.fit(X)
X_transformed = pca.transform(X)
for x,y in zip(X ,X_transformed):
    print(x,y)

Output:

[ 1.  1.] [-2.44120041]
[ 1.  2.] [-1.85996222]
[ 4.  4.] [ 1.74371458]
[ 5.  4.] [ 2.55744805]

As you can see there is a difference in output.

To verify the result i calculated PCA for the same data mathematically. I got same result as it from scikit-learn. Below snippet is of pca transformation calculation for first data point (1.0,1.0): enter image description here

as you can see it matches with scikit learn result.

It seems spark ML doesn't subtract the mean vector MX from data vector X i.e. it uses Y = A*(X) in place of Y = A*(X-MX).

For point (1.0,1.0):

Y = (0.814*1.0)+(0.581*1.0)) = 1.395 

which is same result which we got with spark ML.

Is Spark ML is giving wrong result or am I missing something?

like image 271
Deepak Kumar Avatar asked Dec 12 '17 10:12

Deepak Kumar


1 Answers

In Spark, the PCA transformation will not scale the input data automatically for you. You need to take care of that yourself before applying the method. To normalize the mean of the data, StandardScaler can be used in the following way:

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                    withStd=False, withMean=True)
scaled_df = scaler.fit(df).transform(df)

The PCA method can then be applied on the scaled_df in the same way as before and the results will match what was given by scikit-learn.


I would recommend to make use of the Spark ML pipeline to simplify the process. To use the standardization and PCA together, it could look like this:

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                    withStd=False, withMean=True)
pca = PCA(k=1, inputCol=scaler.getOutputCol(), outputCol="pcaFeatures")
pipeline = Pipeline(stages=[scaler , pca])

model = pipeline.fit(df)
transformed_feature = model.transform(df)
like image 191
Shaido Avatar answered Nov 14 '22 23:11

Shaido