I have a problem with Spark Scala which I want count the average from the Rdd data,I create a new RDD like this,
[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]
I want to count them like this,
[(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)]
then,get the result like this,
[(2,120),(3,204),(4,160)]
How can I do this with scala from RDD? I use spark version 1.6
PySpark – avg() In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame. avg() in PySpark is used to return the average value from a particular column in the DataFrame.
cache(); double count_ctid = (double)join. count(); // i want to get the count of these three RDD double all = (double)lines. count(); double count_cfid = all - CFIDNotNull. count(); System.
colStats() returns an instance of MultivariateStatisticalSummary , which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
Action count() returns the number of elements in RDD. For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd. count()” will give the result 8.
you can use aggregateByKey.
val rdd = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val agg_rdd = rdd.aggregateByKey((0,0))((acc, value) => (acc._1 + value, acc._2 + 1),(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val sum = agg_rdd.mapValues(x => (x._1/x._2))
sum.collect
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With