Feature normalization algorithm in Spark

Tags:

Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors:

{0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},  
{1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0},  
{-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0},  
{-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},  
{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0},

I would expect that new Normalizer().transform(vectors) creates JavaRDD where each vector feature is normalized as (v-mean)/stdev across all values for feature-0, `feature-1, etc.
The resulting set is:

[-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,0.9999999993877552]  
[1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999]  
[-1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999]  
[1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,0.9999999993877552]  
[0.0,0.0,0.0,0.0,0.0,0.0,1.0]

Note that all original values 7000.0 result in different 'normalized' values. Also, how, for example, 1.357142668768307E-5 was calculated when the values are: .95, 1,-1, -.95, 0? What's more, if I remove a feature, the results are different. Could not find any documentation on the issue.
In fact, my question is, how to normalize all vectors in RDD correctly?

363

asked Dec 12 '15 00:12

Alex B

1 Answers

Your expectations are simply incorrect. As it is clearly stated in the official documentation "Normalizer scales individual samples to have unit L ^p norm" where default value for p is 2. Ignoring numerical precision issues:

import org.apache.spark.mllib.linalg.Vectors

val rdd = sc.parallelize(Seq(
    Vectors.dense(0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),  
    Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0),  
    Vectors.dense(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0),  
    Vectors.dense(-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),  
    Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0)))

val transformed = normalizer.transform(rdd)
transformed.map(_.toArray.sum).collect
// Array[Double] = Array(1.0009051182149054, 1.000085713673417,
//   0.9999142851020933, 1.00087797536153, 1.0

MLLib doesn't provide functionality you need but can use StandardScaler from ML.

import org.apache.spark.ml.feature.StandardScaler

val df = rdd.map(Tuple1(_)).toDF("features")

val scaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")
  .setWithStd(true)
  .setWithMean(true)

val transformedDF =  scaler.fit(df).transform(df)

transformedDF.select($"scaledFeatures")show(5, false)

// +--------------------------------------------------------------------------------------------------------------------------+
// |scaledFeatures                                                                                                            |
// +--------------------------------------------------------------------------------------------------------------------------+
// |[0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0]                |
// |[1.0253040317020319,1.4038947727833362,1.414213562373095,-0.6532797101459693,-0.6532797101459693,-0.6010982697825494,0.0] |
// |[-1.0253040317020319,-1.4242574689236265,-1.414213562373095,-0.805205224133404,-0.805205224133404,-0.8536605680105113,0.0]|
// |[-0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0]               |
// |[0.0,-0.010181348070145075,0.0,-0.7292424671396867,-0.7292424671396867,-0.7273794188965303,0.0]                           |
// +--------------------------------------------------------------------------------------------------------------------------+

answered Oct 14 '22 01:10

zero323

Related questions
                            
                                Spark Source code: How to understand withScope method
                            
                                Difference between mapreduce split and spark paritition
                            
                                Sequences in Spark dataframe
                            
                                How to add empty map type column to DataFrame?
                            
                                Why does Spark (on Google Dataproc) not use all vcores?
                            
                                How can we convert an external table to managed table in SPARK 2.2.0?
                            
                                How to execute Column expression in spark without dataframe
                            
                                Slowdown with repeated calls to spark dataframe in memory
                            
                                Difference between df.SaveAsTable and spark.sql(Create table..)
                            
                                Cannot do simple task on ec2 spark cluster from local pyspark
                            
                                Apache Spark -- MlLib -- Collaborative filtering
                            
                                AWS EMR and Spark 1.0.0
                            
                                Apache spark in memory caching
                            
                                How to load directory of JSON files into Apache Spark in Python
                            
                                How to submit spark job from within java program to standalone spark cluster without using spark-submit?
                            
                                Apache Spark GraphX connected components
                            
                                What are Spark RDD graph, lineage graph, DAG of Spark tasks? what are their relations
                            
                                Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
                            
                                What is the equivalent to scala.util.Try in pyspark?
                            
                                Google Cloud Dataproc configuration issues

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Feature normalization algorithm in Spark

Tags:

apache-spark

apache-spark-ml

apache-spark-mllib

Alex B

People also ask

1 Answers

zero323

Recent Activity

Donate For Us