Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark Dataframe Groupby agg() for multiple columns

I have a DataFrame with 3 columns i.e. Id, First Name, Last Name

I want to apply GroupBy on the basis of Id and want to collect First Name, Last Name column as list.

Example :- I have a DF like this

+---+-------+--------+
|id |fName  |lName   |
+---+-------+--------+
|1  |Akash  |Sethi   |
|2  |Kunal  |Kapoor  |
|3  |Rishabh|Verma   |
|2  |Sonu   |Mehrotra|
+---+-------+--------+

and I want my output like this

+---+-------+--------+--------------------+
|id |fname           |lName               |
+---+-------+--------+--------------------+
|1  |[Akash]         |[Sethi]             |
|2  |[Kunal, Sonu]   |[Kapoor, Mehrotra]  |
|3  |[Rishabh]       |[Verma]             |
+---+-------+--------+--------------------+

Thanks in Advance

like image 660
Akash Sethi Avatar asked Mar 17 '17 06:03

Akash Sethi


People also ask

How do you get all the columns after groupBy in Pyspark?

1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.

How do I select multiple columns in Spark data frame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

How do I sum multiple columns in Pyspark DataFrame?

In order to calculate sum of two or more columns in pyspark. we will be using + operator of the column to calculate sum of columns. Second method is to calculate sum of columns in pyspark and add it to the dataframe by using simple + operation along with select Function.


1 Answers

You can aggregate multiple columns like this:

df.groupBy("id").agg(collect_list("fName"), collect_list("lName"))

It will give you the expected result.

like image 163
himanshuIIITian Avatar answered Oct 16 '22 15:10

himanshuIIITian