I need to merge multiple columns of a dataframe into one single column with list(or tuple) as the value for the column using pyspark in python.
Input dataframe:
+-------+-------+-------+-------+-------+
| name |mark1 |mark2 |mark3 | Grade |
+-------+-------+-------+-------+-------+
| Jim | 20 | 30 | 40 | "C" |
+-------+-------+-------+-------+-------+
| Bill | 30 | 35 | 45 | "A" |
+-------+-------+-------+-------+-------+
| Kim | 25 | 36 | 42 | "B" |
+-------+-------+-------+-------+-------+
Output dataframe should be
+-------+-----------------+
| name |marks |
+-------+-----------------+
| Jim | [20,30,40,"C"] |
+-------+-----------------+
| Bill | [30,35,45,"A"] |
+-------+-----------------+
| Kim | [25,36,42,"B"] |
+-------+-----------------+
To combine multiple columns into a single column of arrays in PySpark DataFrame: use the array(~) method in the pyspark. sql. functions library to combine non-array columns.
PySpark Concatenate Using concat() concat() function of Pyspark SQL is used to concatenate multiple DataFrame columns into a single column. It can also be used to concatenate column types string, binary, and compatible array columns.
You can use DataFrame. apply() for concatenate multiple column values into a single column, with slightly less typing and more scalable when you want to join multiple columns .
Using concat_ws() Function to Concatenate with Delimiter Adding a delimiter while concatenating DataFrame columns can be easily done using another function concat_ws() . concat_ws() function takes the first argument as delimiter following with columns that need to concatenate.
Columns can be merged with sparks array function:
import pyspark.sql.functions as f
columns = [f.col("mark1"), ...]
output = input.withColumn("marks", f.array(columns)).select("name", "marks")
You might need to change the type of the entries in order for the merge to be successful
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With