How can I count the average from Spark RDD?

Tags:

I have a problem with Spark Scala which I want count the average from the Rdd data,I create a new RDD like this,

[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]

I want to count them like this,

[(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)]

then,get the result like this,

   [(2,120),(3,204),(4,160)]

How can I do this with scala from RDD? I use spark version 1.6

397

asked Sep 12 '17 08:09

lee

1 Answers

you can use aggregateByKey.

val rdd = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val agg_rdd = rdd.aggregateByKey((0,0))((acc, value) => (acc._1 + value, acc._2 + 1),(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val sum = agg_rdd.mapValues(x => (x._1/x._2))
sum.collect

143

answered Oct 12 '22 12:10

alexgids

Related questions
                            
                                Spark dataframes: Extract a column based on the value of another column
                            
                                Why are default arguments not allowed in a Scala section with repeated parameters?
                            
                                Automatically convert a case class to an extensible record in shapeless?
                            
                                Passing around path dependent type fails to retain dependent value
                            
                                Protobuf objects as Keys in Maps
                            
                                How to use Scala DataFrameReader option method
                            
                                how to find the implicit function or variables in scala
                            
                                how to bind request body in Finch
                            
                                In scala, how can I find the size of an array element
                            
                                Static resource reload with akka-http
                            
                                Why does sbt download a different Scala version than the one in build.sbt?
                            
                                Spark - How many Executors and Cores are allocated to my spark job
                            
                                Sum a list of options in Scala
                            
                                Extract Longs from ByteBuffer (Java/Scala)
                            
                                perform join on multiple DataFrame in spark
                            
                                how to get the sub project path in sbt multi project build
                            
                                Is it possible to execute a command on all workers within Apache Spark?
                            
                                How to use cats and State Monad
                            
                                flatten Vs flatMap with def method and val function
                            
                                Creating/accessing dataframe inside the transformation of another dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I count the average from Spark RDD?

Tags:

scala

apache-spark

rdd

lee

People also ask

1 Answers

alexgids

Recent Activity

Donate For Us