I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc. I want to sum the values of each column, for instance the total number of steps on "steps" column. As far as I see I want to use these kind of functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ But I can understand how to use the function sum. When I write the following: <pre class="prettyprint"><code>val df = CSV.load(args(0)) val sumSteps = df.sum("steps") </code></pre> the function sum cannot be resolved. Do I use the function sum wrongly? Do Ι need to use first the function map? and if yes how? A simple example would be very helpful! I started writing Scala recently.

You must first import the functions: <pre class="prettyprint"><code>import org.apache.spark.sql.functions._ </code></pre> Then you can use them like this: <pre class="prettyprint"><code>val df = CSV.load(args(0)) val sumSteps = df.agg(sum("steps")).first.get(0) </code></pre> You can also cast the result if needed: <pre class="prettyprint"><code>val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0) </code></pre> Edit: For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once: <pre class="prettyprint"><code>val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first </code></pre> Edit2: For dynamically applying the aggregations, the following options are available: <ul> <li>Applying to all numeric columns at once:</li> </ul> <pre class="prettyprint"><code>df.groupBy().sum() </code></pre> <ul> <li>Applying to a list of numeric column names:</li> </ul> <pre class="prettyprint lang-scala prettyprint-override"><code>val columnNames = List("col1", "col2") df.groupBy().sum(columnNames: _*) </code></pre> <ul> <li>Applying to a list of numeric column names with aliases and/or casts:</li> </ul> <pre class="prettyprint lang-scala prettyprint-override"><code>val cols = List("col1", "col2") val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName)) df.groupBy().agg(sums.head, sums.tail:_*).show() </code></pre>

If you want to <code>sum</code> all values of one column, it's more efficient to use <code>DataFrame</code>'s internal <code>RDD</code> and <code>reduce</code>. <pre class="prettyprint"><code>import sqlContext.implicits._ import org.apache.spark.sql.functions._ val df = sc.parallelize(Array(10,2,3,4)).toDF("steps") df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_) //res1 Int = 19 </code></pre>

How to sum the values of one column of a dataframe in spark/scala

Tags:

scala

apache-spark

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.

I want to sum the values of each column, for instance the total number of steps on "steps" column.

As far as I see I want to use these kind of functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

But I can understand how to use the function sum.

When I write the following:

val df = CSV.load(args(0))
val sumSteps = df.sum("steps")

the function sum cannot be resolved.

Do I use the function sum wrongly? Do Ι need to use first the function map? and if yes how?

A simple example would be very helpful! I started writing Scala recently.

505

asked May 04 '16 15:05

Ectoras

3 Answers

You must first import the functions:

import org.apache.spark.sql.functions._

Then you can use them like this:

val df = CSV.load(args(0))
val sumSteps =  df.agg(sum("steps")).first.get(0)

You can also cast the result if needed:

val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)

Edit:

For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:

val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first

Edit2:

For dynamically applying the aggregations, the following options are available:

Applying to all numeric columns at once:

df.groupBy().sum()

Applying to a list of numeric column names:

val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)

Applying to a list of numeric column names with aliases and/or casts:

val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()

193

answered Oct 18 '22 19:10

Daniel de Paula

If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)

//res1 Int = 19

answered Oct 18 '22 19:10

Alberto Bonsanto

Simply apply aggregation function, Sum on your column

df.groupby('steps').sum().show()

Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

answered Oct 18 '22 18:10

shankarj67

Related questions
                            
                                How to add "provided" dependencies back to run/test tasks' classpath?
                            
                                Can a Scala class extend multiple classes?
                            
                                Can java run a compiled scala code?
                            
                                Scala case match default value
                            
                                Create new column with function in Spark Dataframe
                            
                                Iteration over a sealed trait in Scala?
                            
                                How to define and use a User-Defined Aggregate Function in Spark SQL?
                            
                                Which IDE for Scala 2.8? [closed]
                            
                                Adaptation of argument list by inserting () has been deprecated
                            
                                <:< operator in scala
                            
                                Scala - convert List of Lists into a single List: List[List[A]] to List[A]
                            
                                Scala, repeat a finite list infinitely
                            
                                Purely functional data structures for text editors
                            
                                Scala versus F# question: how do they unify OO and FP paradigms?
                            
                                Scala: Ignore case class field for equals/hascode?
                            
                                Strange sbt bug where I cannot import sbt project due to keys colliding with themselves
                            
                                Create Simple Project SBT 0.10.X
                            
                                Declaring multiple variables in Scala
                            
                                Column alias after groupBy in pyspark
                            
                                How to get the current date without time in scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With