The problem arises when I call describe
function on a DataFrame:
val statsDF = myDataFrame.describe()
Calling describe function yields the following output:
statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]
I can show statsDF
normally by calling statsDF.show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 53173|
| mean|104.76128862392568|
| stddev|3577.8184333911513|
| min| 1|
| max| 558407|
+-------+------------------+
I would like now to get the standard deviation and the mean from statsDF
, but when I am trying to collect the values by doing something like:
val temp = statsDF.where($"summary" === "stddev").collect()
I am getting Task not serializable
exception.
I am also facing the same exception when I call:
statsDF.where($"summary" === "stddev").show()
It looks like we cannot filter DataFrames generated by describe()
function?
In Spark, the Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition.
A Spark DataFrame is an integrated data structure with an easy-to-use API for simplifying distributed big data processing. DataFrame is available for general-purpose programming languages such as Java, Python, and Scala.
map(func):Return a new distributed dataset formed by passing each element of the source through a function func. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true.
I have considered a toy dataset I had containing some health disease data
val stddev_tobacco = rawData.describe().rdd.map{
case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With