Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge multiple columns into one column in pyspark dataframe using python

I need to merge multiple columns of a dataframe into one single column with list(or tuple) as the value for the column using pyspark in python.

Input dataframe:

+-------+-------+-------+-------+-------+
| name  |mark1  |mark2  |mark3  | Grade |
+-------+-------+-------+-------+-------+
| Jim   | 20    | 30    | 40    |  "C"  |
+-------+-------+-------+-------+-------+
| Bill  | 30    | 35    | 45    |  "A"  |
+-------+-------+-------+-------+-------+
| Kim   | 25    | 36    | 42    |  "B"  |
+-------+-------+-------+-------+-------+

Output dataframe should be

+-------+-----------------+
| name  |marks            |
+-------+-----------------+
| Jim   | [20,30,40,"C"]  |
+-------+-----------------+
| Bill  | [30,35,45,"A"]  |
+-------+-----------------+
| Kim   | [25,36,42,"B"]  |
+-------+-----------------+
like image 256
Shubham Agrawal Avatar asked Jun 19 '17 09:06

Shubham Agrawal


People also ask

How do I convert multiple columns to single column in Pyspark?

To combine multiple columns into a single column of arrays in PySpark DataFrame: use the array(~) method in the pyspark. sql. functions library to combine non-array columns.

How do you merge columns in Pyspark?

PySpark Concatenate Using concat() concat() function of Pyspark SQL is used to concatenate multiple DataFrame columns into a single column. It can also be used to concatenate column types string, binary, and compatible array columns.

How do I combine multiple columns into one in Python?

You can use DataFrame. apply() for concatenate multiple column values into a single column, with slightly less typing and more scalable when you want to join multiple columns .

How do I concatenate all columns in Spark DataFrame?

Using concat_ws() Function to Concatenate with Delimiter Adding a delimiter while concatenating DataFrame columns can be easily done using another function concat_ws() . concat_ws() function takes the first argument as delimiter following with columns that need to concatenate.


1 Answers

Columns can be merged with sparks array function:

import pyspark.sql.functions as f

columns = [f.col("mark1"), ...] 

output = input.withColumn("marks", f.array(columns)).select("name", "marks")

You might need to change the type of the entries in order for the merge to be successful

like image 136
Michael Panchenko Avatar answered Sep 29 '22 11:09

Michael Panchenko