Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I count the average from Spark RDD?

I have a problem with Spark Scala which I want count the average from the Rdd data,I create a new RDD like this,

[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]

I want to count them like this,

[(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)]

then,get the result like this,

   [(2,120),(3,204),(4,160)]

How can I do this with scala from RDD? I use spark version 1.6

like image 397
lee Avatar asked Sep 12 '17 08:09

lee


People also ask

How do you calculate average in Spark?

PySpark – avg() In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame. avg() in PySpark is used to return the average value from a particular column in the DataFrame.

How do you count records in RDD?

cache(); double count_ctid = (double)join. count(); // i want to get the count of these three RDD double all = (double)lines. count(); double count_cfid = all - CFIDNotNull. count(); System.

Which RDD function returns min/max count mean?

colStats() returns an instance of MultivariateStatisticalSummary , which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.

Is count an action in Spark?

Action count() returns the number of elements in RDD. For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd. count()” will give the result 8.


1 Answers

you can use aggregateByKey.

val rdd = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val agg_rdd = rdd.aggregateByKey((0,0))((acc, value) => (acc._1 + value, acc._2 + 1),(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val sum = agg_rdd.mapValues(x => (x._1/x._2))
sum.collect
like image 143
alexgids Avatar answered Oct 12 '22 12:10

alexgids