I have three Arrays of string type containing following information: <ul> <li>groupBy array: containing names of the columns I want to group my data by.</li> <li>aggregate array: containing names of columns I want to aggregate.</li> <li>operations array: containing the aggregate operations I want to perform</li> </ul> I am trying to use spark data frames to achieve this. Spark data frames provide an agg() where you can pass a Map [String,String] (of column name and respective aggregate operation ) as input, however I want to perform different aggregation operations on the same column of the data. Any suggestions on how to achieve this?

Scala: You can for example map over a list of functions with a defined <code>mapping</code> from name to function: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.functions.{col, min, max, mean} import org.apache.spark.sql.Column val df = Seq((1L, 3.0), (1L, 3.0), (2L, -5.0)).toDF("k", "v") val mapping: Map[String, Column => Column] = Map( "min" -> min, "max" -> max, "mean" -> avg) val groupBy = Seq("k") val aggregate = Seq("v") val operations = Seq("min", "max", "mean") val exprs = aggregate.flatMap(c => operations .map(f => mapping(f)(col(c)))) df.groupBy(groupBy.map(col): _*).agg(exprs.head, exprs.tail: _*).show // +---+------+------+------+ // | k|min(v)|max(v)|avg(v)| // +---+------+------+------+ // | 1| 3.0| 3.0| 3.0| // | 2| -5.0| -5.0| -5.0| // +---+------+------+------+ </code></pre> or <pre class="prettyprint"><code>df.groupBy(groupBy.head, groupBy.tail: _*).agg(exprs.head, exprs.tail: _*).show </code></pre> Unfortunately parser which is used internally <code>SQLContext</code> is not exposed publicly but you can always try to build plain SQL queries: <pre class="prettyprint lang-scala prettyprint-override"><code>df.registerTempTable("df") val groupExprs = groupBy.mkString(",") val aggExprs = aggregate.flatMap(c => operations.map( f => s"$f($c) AS ${c}_${f}") ).mkString(",") sqlContext.sql(s"SELECT $groupExprs, $aggExprs FROM df GROUP BY $groupExprs") </code></pre> Python: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import mean, sum, max, col df = sc.parallelize([(1, 3.0), (1, 3.0), (2, -5.0)]).toDF(["k", "v"]) groupBy = ["k"] aggregate = ["v"] funs = [mean, sum, max] exprs = [f(col(c)) for f in funs for c in aggregate] # or equivalent df.groupby(groupBy).agg(*exprs) df.groupby(*groupBy).agg(*exprs) </code></pre> See also: <ul> <li>Spark SQL: apply aggregate functions to a list of column</li> </ul>

Multiple Aggregate operations on the same column of a spark dataframe

Tags:

dataframe

apache-spark

apache-spark-sql

I have three Arrays of string type containing following information:

groupBy array: containing names of the columns I want to group my data by.
aggregate array: containing names of columns I want to aggregate.
operations array: containing the aggregate operations I want to perform

I am trying to use spark data frames to achieve this. Spark data frames provide an agg() where you can pass a Map [String,String] (of column name and respective aggregate operation ) as input, however I want to perform different aggregation operations on the same column of the data. Any suggestions on how to achieve this?

889

asked Jan 22 '16 19:01

Richa Banker

1 Answers

Scala:

You can for example map over a list of functions with a defined mapping from name to function:

import org.apache.spark.sql.functions.{col, min, max, mean} import org.apache.spark.sql.Column  val df = Seq((1L, 3.0), (1L, 3.0), (2L, -5.0)).toDF("k", "v") val mapping: Map[String, Column => Column] = Map(   "min" -> min, "max" -> max, "mean" -> avg)  val groupBy = Seq("k") val aggregate = Seq("v") val operations = Seq("min", "max", "mean") val exprs = aggregate.flatMap(c => operations .map(f => mapping(f)(col(c))))  df.groupBy(groupBy.map(col): _*).agg(exprs.head, exprs.tail: _*).show // +---+------+------+------+ // |  k|min(v)|max(v)|avg(v)| // +---+------+------+------+ // |  1|   3.0|   3.0|   3.0| // |  2|  -5.0|  -5.0|  -5.0| // +---+------+------+------+

df.groupBy(groupBy.head, groupBy.tail: _*).agg(exprs.head, exprs.tail: _*).show

Unfortunately parser which is used internally SQLContext is not exposed publicly but you can always try to build plain SQL queries:

df.registerTempTable("df") val groupExprs = groupBy.mkString(",") val aggExprs = aggregate.flatMap(c => operations.map(   f => s"$f($c) AS ${c}_${f}") ).mkString(",")  sqlContext.sql(s"SELECT $groupExprs, $aggExprs FROM df GROUP BY $groupExprs")

Python:

from pyspark.sql.functions import mean, sum, max, col  df = sc.parallelize([(1, 3.0), (1, 3.0), (2, -5.0)]).toDF(["k", "v"]) groupBy = ["k"] aggregate = ["v"]  funs = [mean, sum, max]  exprs = [f(col(c)) for f in funs for c in aggregate]  # or equivalent df.groupby(groupBy).agg(*exprs) df.groupby(*groupBy).agg(*exprs)

zero323

Related questions
                            
                                Spark2.1.0 incompatible Jackson versions 2.7.6
                            
                                How to obtain the symmetric difference between two DataFrames?
                            
                                Difference between na().drop() and filter(col.isNotNull) (Apache Spark)
                            
                                Explode array data into rows in spark [duplicate]
                            
                                How to run external jar functions in spark-shell
                            
                                How to count occurrences of each distinct value for every column in a dataframe?
                            
                                Filter Spark DataFrame by checking if value is in a list, with other criteria
                            
                                Create new Dataframe with empty/null field values
                            
                                Scala: How can I replace value in Dataframes using scala
                            
                                Select columns in PySpark dataframe
                            
                                Spark Dataframe :How to add a index Column : Aka Distributed Data Index
                            
                                Getting Spark, Python, and MongoDB to work together
                            
                                Easiest way to install Python dependencies on Spark executor nodes?
                            
                                Determining optimal number of Spark partitions based on workers, cores and DataFrame size
                            
                                Spark Unable to load native-hadoop library for your platform
                            
                                How to partition and write DataFrame in Spark without deleting partitions with no new data?
                            
                                What is spark.driver.maxResultSize?
                            
                                Spark RDD - Mapping with extra arguments
                            
                                How do I install pyspark for use in standalone scripts?
                            
                                Spark Scala list folders in directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With