Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors:
{0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},
{1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0},
{-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0},
{-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},
{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0},
I would expect that new Normalizer().transform(vectors)
creates JavaRDD
where each vector feature is normalized as (v-mean)/stdev
across all values for feature-0, `feature-1, etc.
The resulting set is:
[-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,0.9999999993877552]
[1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999]
[-1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999]
[1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,0.9999999993877552]
[0.0,0.0,0.0,0.0,0.0,0.0,1.0]
Note that all original values 7000.0 result in different 'normalized' values. Also, how, for example, 1.357142668768307E-5
was calculated when the values are: .95
, 1
,-1
, -.95
, 0
? What's more, if I remove a feature, the results are different. Could not find any documentation on the issue.
In fact, my question is, how to normalize all vectors in RDD correctly?
Normalizer - Data Science with Apache Spark. Normalizer. Normalizer is a Transformer which transforms a dataset of Vector rows, normalizing each Vector to have unit norm.
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.
March 30, 2021. Apache Spark MLlib contains many utility functions for performing feature engineering at scale, including methods for encoding and transforming features. These methods can also be used to process features for other machine learning libraries.
Currently, Spark supports three Univariate Feature Selectors: chi-squared, ANOVA F-test and F-value. User can choose Univariate Feature Selector by setting featureType and labelType , and Spark will pick the score function based on the specified featureType and labelType .
Your expectations are simply incorrect. As it is clearly stated in the official documentation "Normalizer
scales individual samples to have unit L p norm" where default value for p is 2. Ignoring numerical precision issues:
import org.apache.spark.mllib.linalg.Vectors
val rdd = sc.parallelize(Seq(
Vectors.dense(0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),
Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0),
Vectors.dense(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0),
Vectors.dense(-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),
Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0)))
val transformed = normalizer.transform(rdd)
transformed.map(_.toArray.sum).collect
// Array[Double] = Array(1.0009051182149054, 1.000085713673417,
// 0.9999142851020933, 1.00087797536153, 1.0
MLLib
doesn't provide functionality you need but can use StandardScaler
from ML
.
import org.apache.spark.ml.feature.StandardScaler
val df = rdd.map(Tuple1(_)).toDF("features")
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(true)
val transformedDF = scaler.fit(df).transform(df)
transformedDF.select($"scaledFeatures")show(5, false)
// +--------------------------------------------------------------------------------------------------------------------------+
// |scaledFeatures |
// +--------------------------------------------------------------------------------------------------------------------------+
// |[0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0] |
// |[1.0253040317020319,1.4038947727833362,1.414213562373095,-0.6532797101459693,-0.6532797101459693,-0.6010982697825494,0.0] |
// |[-1.0253040317020319,-1.4242574689236265,-1.414213562373095,-0.805205224133404,-0.805205224133404,-0.8536605680105113,0.0]|
// |[-0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0] |
// |[0.0,-0.010181348070145075,0.0,-0.7292424671396867,-0.7292424671396867,-0.7273794188965303,0.0] |
// +--------------------------------------------------------------------------------------------------------------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With