Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compute the mean with Apache spark?

I dispose of a list of Double stored like this :

JavaRDD<Double> myDoubles

I would like to compute the mean of this list. According to the documentation, :

All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object.

On the same page, I see the following code :

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

From my understanding, this is equivalent (in term of types) to

Double MSE = RDD<Double>.mean()

As a consequence, I tried to compute the mean of my JavaRDD like this :

myDoubles.rdd().mean()

However, it doesn't work and gives me the following eror : The method mean() is undefined for the type RDD<Double>. I also didn't find mention of this function in the RDD scala documentation. . Is this because of a bad understanding of my side, or is this something else ?

like image 487
merours Avatar asked Jul 11 '14 09:07

merours


People also ask

How do you calculate weighted average in PySpark?

To calculate the grouped weighted average of the above (70) is broken into two steps: Multiplying sales by importance. Aggregating the sales_x_count product. Dividing sales_x_count by the sum of the original.

How do you find the median in PySpark?

Working of Median PySpark The median operation takes a set value from the column as input, and the output is further generated and returned as a result. We can define our own UDF in PySpark, and then we can use the python library np. The numpy has the method that calculates the median of a data frame.

How do you get summary statistics in PySpark?

To calculate descriptive statistics or summary statistics of an entire dataframe or column(s) of a dataframe in PySpark, we use the "describe()" function.


1 Answers

It's actually quite simple: mean() is defined for the JavaDoubleRDD class. I didn't find how to cast from JavaRDD<Double> to JavaDoubleRDD, but in my case, it was not necessary.

Indeed, this line in scala

val mean = valuesAndPreds.map{case(v, p) => (v - p)}.mean()

can be expressed in Java as

double mean = valuesAndPreds.mapToDouble(tuple -> tuple._1 - tuple._2).mean();
like image 82
merours Avatar answered Sep 26 '22 10:09

merours