Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 1.6: filtering DataFrames generated by describe()

The problem arises when I call describe function on a DataFrame:

val statsDF = myDataFrame.describe()

Calling describe function yields the following output:

statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]

I can show statsDF normally by calling statsDF.show()

+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|             53173|
|   mean|104.76128862392568|
| stddev|3577.8184333911513|
|    min|                 1|
|    max|            558407|
+-------+------------------+

I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:

val temp = statsDF.where($"summary" === "stddev").collect()

I am getting Task not serializable exception.

I am also facing the same exception when I call:

statsDF.where($"summary" === "stddev").show()

It looks like we cannot filter DataFrames generated by describe() function?

like image 930
Rami Avatar asked Feb 08 '16 14:02

Rami


People also ask

What is the function of filter () in Spark?

In Spark, the Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition.

How would you describe a Spark in a DataFrame?

A Spark DataFrame is an integrated data structure with an easy-to-use API for simplifying distributed big data processing. DataFrame is available for general-purpose programming languages such as Java, Python, and Scala.

What is the difference between filter and map in Spark?

map(func):Return a new distributed dataset formed by passing each element of the source through a function func. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true.


1 Answers

I have considered a toy dataset I had containing some health disease data

val stddev_tobacco = rawData.describe().rdd.map{ 
    case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
like image 65
eliasah Avatar answered Feb 12 '23 02:02

eliasah