Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala/ Spark- Multiply an Integer with each value in a Dataframe Column

I have a sample dataframe

df_that_I_have
+---------+---------+-------+
| country | members | some  |
+---------+---------+-------+
| India   | 50      | 1     |
+---------+---------+-------+
| Japan   | 20      | 3     |
+---------+---------+-------+
| India   | 20      | 1     |
+---------+---------+-------+
| Japan   | 10      | 3     |
+---------+---------+-------+

and I want a dataframe that looks like this

df_that_I_want
+---------+---------+-------+
| country | members | some  |
+---------+---------+-------+
| India   | 70      | 10    | // 5 * Sum of "some" for India, i.e. (1 + 1)
+---------+---------+-------+
| Japan   | 30      | 30    | // 5 * Sum of "some" for Japan, i.e. (3 + 3)
+---------+---------+-------+

The second dataframe has the sum of members and the sum of some multiplied 5.

This is what I'm doing to achieve this

val df_that_I_want = df_that_I_have
                        .select(df_that_I_have("country"),
                                df_that_I_have.groupBy("country").sum("members"),
                                5 * df_that_I_have.groupBy("country").sum("some")) //Problem here

But the compiler does not allow me to do this because apparently I can't multiply 5 with a column.

How can I multiply an Integer value with the sum of some for each country?

like image 528
Amber Avatar asked Apr 18 '17 07:04

Amber


People also ask

What is the difference between == and === in Scala?

For Column: == returns a boolean. === returns a column (which contains the result of the comparisons of the elements of two columns)

What does take () do in Spark?

In Spark, the take function behaves like an array. It receives an integer value (let say, n) as a parameter and returns an array of first n elements of the dataset.

What does AGG do in Scala?

agg. (Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns. The available aggregate methods are avg , max , min , sum , count .

What does withColumn in Spark do?

In Spark SQL, the withColumn() function is the most popular one, which is used to derive a column from multiple columns, change the current value of a column, convert the datatype of an existing column, create a new column, and many more.


2 Answers

You can try lit function.

scala> val df_that_I_have = Seq(("India",50,1),("India",20,1),("Japan",20,3),("Japan",10,3)).toDF("Country","Members","Some")
df_that_I_have: org.apache.spark.sql.DataFrame = [Country: string, Members: int, Some: int]

scala> val df1 = df_that_I_have.groupBy("country").agg(sum("members"), sum("some") * lit(5))
df1: org.apache.spark.sql.DataFrame = [country: string, sum(members): bigint, ((sum(some),mode=Complete,isDistinct=false) * 5): bigint]

scala> val df_that_I_want= df1.select($"Country",$"sum(Members)".alias("Members"), $"((sum(Some),mode=Complete,isDistinct=false) * 5)".alias("Some"))
df_that_I_want: org.apache.spark.sql.DataFrame = [Country: string, Members: bigint, Some: bigint]

scala> df_that_I_want.show

+-------+-------+----+
|Country|Members|Some|
+-------+-------+----+
|  India|     70|  10|
|  Japan|     30|  30|
+-------+-------+----+
like image 145
Rajat Mishra Avatar answered Oct 03 '22 08:10

Rajat Mishra


Please try this

df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
like image 41
pasha701 Avatar answered Oct 03 '22 10:10

pasha701