The problem arises when I call <code>describe</code> function on a DataFrame: <pre class="prettyprint lang-scala prettyprint-override"><code>val statsDF = myDataFrame.describe() </code></pre> Calling describe function yields the following output: <pre class="prettyprint"><code>statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string] </code></pre> I can show <code>statsDF</code> normally by calling <code>statsDF.show()</code> <pre class="prettyprint"><code>+-------+------------------+ |summary| count| +-------+------------------+ | count| 53173| | mean|104.76128862392568| | stddev|3577.8184333911513| | min| 1| | max| 558407| +-------+------------------+ </code></pre> I would like now to get the standard deviation and the mean from <code>statsDF</code>, but when I am trying to collect the values by doing something like: <pre class="prettyprint lang-scala prettyprint-override"><code>val temp = statsDF.where($"summary" === "stddev").collect() </code></pre> I am getting <code>Task not serializable</code> exception. I am also facing the same exception when I call: <pre class="prettyprint lang-scala prettyprint-override"><code>statsDF.where($"summary" === "stddev").show() </code></pre> It looks like we cannot filter DataFrames generated by <code>describe()</code> function?

I have considered a toy dataset I had containing some health disease data <pre class="prettyprint"><code>val stddev_tobacco = rawData.describe().rdd.map{ case r : Row => (r.getAs[String]("summary"),r.get(1)) }.filter(_._1 == "stddev").map(_._2).collect </code></pre>

Spark 1.6: filtering DataFrames generated by describe()

Tags:

apache-spark

apache-spark-sql

apache-zeppelin

The problem arises when I call describe function on a DataFrame:

val statsDF = myDataFrame.describe()

Calling describe function yields the following output:

statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]

I can show statsDF normally by calling statsDF.show()

+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|             53173|
|   mean|104.76128862392568|
| stddev|3577.8184333911513|
|    min|                 1|
|    max|            558407|
+-------+------------------+

I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:

val temp = statsDF.where($"summary" === "stddev").collect()

I am getting Task not serializable exception.

I am also facing the same exception when I call:

statsDF.where($"summary" === "stddev").show()

It looks like we cannot filter DataFrames generated by describe() function?

930

asked Feb 08 '16 14:02

Rami

1 Answers

I have considered a toy dataset I had containing some health disease data

val stddev_tobacco = rawData.describe().rdd.map{ 
    case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect

answered Feb 12 '23 02:02

eliasah

Related questions
                            
                                Returning Multiple Arrays from User-Defined Aggregate Function (UDAF) in Apache Spark SQL
                            
                                Unit testing with Spark dataframes
                            
                                Apache spark Hive, executable JAR with maven shade
                            
                                Non linear (DAG) ML pipelines in Apache Spark
                            
                                Pyspark socket timeout exception after application running for a while
                            
                                Share config files with spark-submit in cluster mode
                            
                                Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark
                            
                                How to exclude jar in final sbt assembly plugin
                            
                                How can I tell if my spark job is progressing?
                            
                                Difference between spark-submit vs. SparkSession in python script?
                            
                                Spark ML Pipeline with RandomForest takes too long on 20MB dataset
                            
                                Understanding DAG in spark
                            
                                Databricks display() function equivalent or alternative to Jupyter
                            
                                PySpark dataframe to_json() function
                            
                                How to run two spark jobs in parallel in standalone mode [duplicate]
                            
                                Spark - Reading many small parquet files gets status of each file before hand
                            
                                How to let pyspark display the whole query plan instead of ... if there are many fields?
                            
                                Does reducing the number of executor-cores consume less executor-memory?
                            
                                Spark policy for handling multiple watermarks
                            
                                Why does spark-shell throw ArrayIndexOutOfBoundsException when reading a large file from HDFS?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With