How to normalize or standardize the data having multiple columns/variables in spark using scala?

Tags:

I am new to apache spark and scala. I have data set like this which I am taking from csv file and converting it into RDD using scala.

+-----------+-----------+----------+
| recent    | Freq      | Monitor  |
+-----------+-----------+----------+
|        1  |       1234 |   199090|
|        4  |       2553|    198613|
|        6  |       3232 |   199090|
|        1  |       8823 |   498831|
|        7  |       2902 |   890000|
|        8  |       7991 |   081097|
|        9  |       7391 |   432370|
|        12 |       6138 |   864981|
|        7  |       6812 |   749821|
+-----------+-----------+----------+

I want to calculate z-score value or to standardize the data. So I am calculating the z-score for each column and then try to combine them so I get standard scale.

Here is my code for calculating the z-score for first column

val scores1 = sorted.map(_.split(",")(0)).cache
val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / count)
val zscore = sorted.map(x => math.round((x.toDouble - mean)/stddev))

How do I calculate for each column ? Or is there any other way to normalize or standardize the data ?

My requirement is to assign the rank(or scale).

Thanks

901

asked Apr 20 '16 07:04

Niranjanp

2 Answers

If you want to standardize the columns, you can use the StandardScaler class from Spark MLlib. Data should be in the form of RDD[Vectors[Double], where Vectors are a part of MLlib Linalg package. You can choose to use mean or standard deviation or both to standardize your data.

import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors

val data = sc.parallelize(Array(
    Array(1.0,2.0,3.0),
    Array(4.0,5.0,6.0),
    Array(7.0,8.0,9.0),
    Array(10.0,11.0,12.0)))

// Converting RDD[Array] to RDD[Vectors]
val features = data.map(a => Vectors.dense(a))
// Creating a Scaler model that standardizes with both mean and SD
val scaler = new StandardScaler(withMean = true, withStd = true).fit(features)
// Scale features using the scaler model
val scaledFeatures = scaler.transform(features)

This scaledFeatures RDD contains the Z-score of all columns.

Hope this answer helps. Check the Documentation for more info.

131

answered Nov 15 '22 06:11

ar7

You may want to use below code to perform Standard Scaling on required columns.Vector Assembler is used to select required columns that need to be transformed. StandardScaler constructor also provides you an option to select values of Mean and Standard deviation

Code:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature
import org.apache.spark.ml.feature.StandardScaler

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/user/hadoop/data/your_dataset.csv")
df.show(Int.MaxValue)

val assembler = new VectorAssembler().setInputCols(Array("recent","Freq","Monitor")).setOutputCol("features")

val transformVector = assembler.transform(df)

val scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false)

val scalerModel = scaler.fit(transformVector)
val scaledData = scalerModel.transform(transformVector)

scaledData.show() 20, False
scaledData.show(Int.MaxValue)
scaledData.show(20, false)

answered Nov 15 '22 05:11

Akash Singh

Related questions
                            
                                Why are there RichInt or RichX in Scala?
                            
                                How to create a bigram from a text file with frequency count in Spark/Scala?
                            
                                Suffix array beginning using scala
                            
                                End-to-End Reactive Streaming RESTful service (a.k.a. Back-Pressure over HTTP)
                            
                                Running a Spark Application in Intellij 14.1.3
                            
                                zipWithIndex on Apache Flink
                            
                                Declare a global variable in scala
                            
                                convert function to partial function scala
                            
                                What is the difference between partition and groupBy?
                            
                                How do I specify a config file with play 2.4 and activator
                            
                                Crontab style scheduling in Play 2.4.x?
                            
                                What is the keyboard shortcut for ⇒ in your Scala editor of choice?
                            
                                How to Validate contents of Spark Dataframe
                            
                                Scala pattern matching on generic type with TypeTag generates a warning while ClassTag not?
                            
                                Adding Two Lists of Same Size at Compile-time [duplicate]
                            
                                How to join two lists in Scala?
                            
                                Missing parameter type
                            
                                Akka Future - Parallel versus Concurrent?
                            
                                Convert scala to native binary
                            
                                Access Spark broadcast variable in different classes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to normalize or standardize the data having multiple columns/variables in spark using scala?

Tags:

scala

statistics

apache-spark

Niranjanp

People also ask

2 Answers

ar7

Akash Singh

Recent Activity

Donate For Us