Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Outlier detection algorithm spark mllib

Is there any pre-built Outlier Detection Algorithm/Interquartile Range identification methods available in Spark 2.0.0 ? I found some code here but i dont think this is available yet in spark2.0.0

Thanks

like image 832
Sudhakar Chavan Avatar asked Oct 08 '16 07:10

Sudhakar Chavan


1 Answers

If you don´t found a prebuilt method you can do something like that:

Example Outlier detection using Box-and-Whisker Plot:

val sampleData = List(10.2, 14.1,14.4,14.4,14.4,14.5,14.5,14.6,14.7,
               14.7, 14.7,14.9,15.1, 15.9,16.4)
val rowRDD = sparkSession.sparkContext.makeRDD(sampleData.map(value => Row(value)))
val schema = StructType(Array(StructField("value",DoubleType)))
val df = sparkSession.createDataFrame(rowRDD,schema)
val quantiles = df.stat.approxQuantile("value", Array(0.25,0.75),0.0)
val Q1 = quantiles(0)
val Q3 = quantiles(1)
val IQR = Q3 - Q1
val lowerRange = Q1 - 1.5*IQR
val upperRange = Q3+ 1.5*IQR

val outliers = df.filter(s"value < $lowerRange or value > $upperRange")
outliers.show()

solution source:

Outlier Detection using Quantiles

like image 195
mjimcua Avatar answered Sep 30 '22 21:09

mjimcua