Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.

I want to sum the values of each column, for instance the total number of steps on "steps" column.

As far as I see I want to use these kind of functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

But I can understand how to use the function sum.

When I write the following:

val df = CSV.load(args(0))
val sumSteps = df.sum("steps") 

the function sum cannot be resolved.

Do I use the function sum wrongly? Do Ι need to use first the function map? and if yes how?

A simple example would be very helpful! I started writing Scala recently.

like image 505
Ectoras Avatar asked May 04 '16 15:05

Ectoras


People also ask

How do I sum two columns in DataFrame Spark?

In order to calculate sum of two or more columns in pyspark. we will be using + operator of the column to calculate sum of columns. Second method is to calculate sum of columns in pyspark and add it to the dataframe by using simple + operation along with select Function.

How do I extract a column in Spark?

In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .


3 Answers

You must first import the functions:

import org.apache.spark.sql.functions._

Then you can use them like this:

val df = CSV.load(args(0))
val sumSteps =  df.agg(sum("steps")).first.get(0)

You can also cast the result if needed:

val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)

Edit:

For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:

val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first

Edit2:

For dynamically applying the aggregations, the following options are available:

  • Applying to all numeric columns at once:
df.groupBy().sum()
  • Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
  • Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
like image 193
Daniel de Paula Avatar answered Oct 18 '22 19:10

Daniel de Paula


If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)

//res1 Int = 19
like image 25
Alberto Bonsanto Avatar answered Oct 18 '22 19:10

Alberto Bonsanto


Simply apply aggregation function, Sum on your column

df.groupby('steps').sum().show()

Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

like image 9
shankarj67 Avatar answered Oct 18 '22 18:10

shankarj67