Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to concatenate multiple columns in PySpark with a separator?

I have a pyspark Dataframe, I would like to join 3 columns.

id |  column_1   | column_2    | column_3
1  |     12      |   34        |    67
2  |     45      |   78        |    90
3  |     23      |   93        |    56

I want to join the 3 columns : column_1, column_2, column_3 in only one adding between there value "-"

Expect result:

id |  column_1   | column_2    | column_3    |   column_join
1  |     12      |     34      |     67      |   12-34-67
2  |     45      |     78      |     90      |   45-78-90
3  |     23      |     93      |     56      |   23-93-56

How can I do it in pyspark ? Thank you

like image 333
verojoucla Avatar asked Nov 25 '19 13:11


People also ask

How do you concatenate two values in Pyspark?

Concatenating columns in pyspark is accomplished using concat() Function. Concatenating two columns is accomplished using concat() Function. Concatenating multiple columns is accomplished using concat() Function. Concatenating columns in pyspark is accomplished using concat() Function.

How do you split columns in Pyspark?

The PySpark SQL provides the split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame It can be done by splitting the string column on the delimiter like space, comma, pipe, etc.

How do you zip two columns in Pyspark?

We iterate over the items in column_names and column_values to create a list of the pairs, and then use list(chain. from_iterable(...)) to flatten the list. After the list is made, you can select the field by name.

2 Answers

It's pretty simple:

from pyspark.sql.functions import col, concat, lit

df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3")))

Use concat to concatenate all the columns with the - separator, for which you will need to use lit.

If it doesn't directly work, you can use cast to change the column types to string, col("column_1").cast("string")


Or you can use a more dynamic approach using a built-in function concat_ws

pyspark.sql.functions.concat_ws(sep, *cols)

Concatenates multiple input string columns together into a single string column, using the given separator.

>>> df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
>>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect()


from pyspark.sql.functions import col, concat_ws

concat_columns = ["column_1", "column_2", "column_3"]
df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns]))
like image 184
pissall Avatar answered Oct 17 '22 03:10


Here is a generic/dynamic way of doing this, instead of manually concatenating it. All we need is to specify the columns that we need to concatenate.

# Importing requisite functions.
from pyspark.sql.functions import col, udf

# Creating the DataFrame
df = spark.createDataFrame([(1,12,34,67),(2,45,78,90),(3,23,93,56)],['id','column_1','column_2','column_3'])

Now, specifying the list of columns we want to concatenate, separated by -.

list_of_columns_to_join = ['column_1','column_2','column_3']

Finally, creating a UDF. Mind it, UDF based solutions are implicitly slower.

def concat_cols(*list_cols):
    return '-'.join(list([str(i) for i in list_cols]))

concat_cols = udf(concat_cols)
df = df.withColumn('column_join', concat_cols(*list_of_columns_to_join))
| id|column_1|column_2|column_3|column_join|
|  1|      12|      34|      67|   12-34-67|
|  2|      45|      78|      90|   45-78-90|
|  3|      23|      93|      56|   23-93-56|
like image 2
cph_sto Avatar answered Oct 17 '22 03:10
