Is there any pre-built Outlier Detection Algorithm/Interquartile Range identification methods available in Spark 2.0.0 ? I found some code here but i dont think this is available yet in spark2.0.0
Thanks
If you don´t found a prebuilt method you can do something like that:
Example Outlier detection using Box-and-Whisker Plot:
val sampleData = List(10.2, 14.1,14.4,14.4,14.4,14.5,14.5,14.6,14.7,
14.7, 14.7,14.9,15.1, 15.9,16.4)
val rowRDD = sparkSession.sparkContext.makeRDD(sampleData.map(value => Row(value)))
val schema = StructType(Array(StructField("value",DoubleType)))
val df = sparkSession.createDataFrame(rowRDD,schema)
val quantiles = df.stat.approxQuantile("value", Array(0.25,0.75),0.0)
val Q1 = quantiles(0)
val Q3 = quantiles(1)
val IQR = Q3 - Q1
val lowerRange = Q1 - 1.5*IQR
val upperRange = Q3+ 1.5*IQR
val outliers = df.filter(s"value < $lowerRange or value > $upperRange")
outliers.show()
solution source:
Outlier Detection using Quantiles
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With