Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does StandardScaler give non-zero values for dimensions as the variance is not zero?

I have a DataFrame that looks like follow:

+-----+--------------------+
|  uid|            features|
+-----+--------------------+
|user1|       (7,[1],[5.0])|
|user2|(7,[0,2],[13.0,4.0])|
|user3|(7,[2,3],[7.0,45.0])|
+-----+--------------------+

The features column is a sparse vector, with size equal to 4.

I am applying a StandardScaler as follow:

import org.apache.spark.ml.feature.StandardScaler

val scaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")
  .setWithStd(true)
  .setWithMean(false)

val scalerModel = scaler.fit(df)

// Normalize each feature to have unit standard deviation.
val scaledData = scalerModel.transform(transformed)

The output DataFrame looks like follow:

+-----+--------------------+--------------------+
|  uid|            features|      scaledFeatures|
+-----+--------------------+--------------------+
|user1|       (7,[1],[5.0])|(7,[1],[1.7320508...|
|user2|(7,[0,2],[13.0,4.0])|(7,[0,2],[1.73205...|
|user3|(7,[2,3],[7.0,45.0])|(7,[2,3],[1.99323...|
+-----+--------------------+--------------------+

As we can see that the scaledFeatures of user1 for example contain only one element (the others are zeros), but I am expecting that each scaledFeatures contains always non zero values for all dimensions as the variance is not zero.

Let's take for example the third dimension, i.e. the index 2 of each feature vector:

  • This dimension has a value of 0.0 for user1, 4.0 for user2 and 7.0 for user3.
  • The mean of these values is: (0+4+7)/3 = 3.667
  • The SD is: sqrt[ ( (0-3.667)^2 + (4-3.667)^2 + (7-3.667)^2 ) /3] = 2.868
  • The the unit standard deviation for user1 should be: (value-average)/SD = (0-3.667)/2.868 = -1.279

The question is: why user1 in the output DataFrame has zero value for the this dimension?

like image 354
Rami Avatar asked Nov 30 '25 02:11

Rami


1 Answers

Here is the culprit:

.setWithMean(false) 

Since only thing you apply is scaling to unit standard deviation the result is exactly as it should be:

xs1 <- c(5, 0, 0)
xs1 / sd(xs1)
## [1] 1.732051 0.000000 0.000000
sd(xs1 / sd(xs1))
## [1] 1

xs2 <- c(0.0, 4.0, 7.0)
xs2 / sd(xs2)
## [1] 0.000000 1.138990 1.993232
sd(xs2 / sd(xs2))
## [1] 1

Also withMean requires dense data. From the docs:

withMean: False by default. Centers the data with mean before scaling. It will build a dense output, so this does not work on sparse input and will raise an exception.

Merged from the comments:

So without setWithMean it will not subtract the mean from the value, but it will directly divide the value by sd.

In order to do .setWithMean(true) I had to convert the features to a dense vector instead of a sparse one (as it throws an exception for sparse vectors).

like image 103
zero323 Avatar answered Dec 06 '25 06:12

zero323



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!