I have a <code>pyspark Dataframe</code>, I would like to join 3 columns. <pre class="prettyprint"><code>id | column_1 | column_2 | column_3 -------------------------------------------- 1 | 12 | 34 | 67 -------------------------------------------- 2 | 45 | 78 | 90 -------------------------------------------- 3 | 23 | 93 | 56 -------------------------------------------- </code></pre> I want to join the 3 columns : <code>column_1, column_2, column_3</code> in only one adding between there value <code>"-"</code> Expect result: <pre class="prettyprint"><code>id | column_1 | column_2 | column_3 | column_join ------------------------------------------------------------- 1 | 12 | 34 | 67 | 12-34-67 ------------------------------------------------------------- 2 | 45 | 78 | 90 | 45-78-90 ------------------------------------------------------------- 3 | 23 | 93 | 56 | 23-93-56 ------------------------------------------------------------- </code></pre> How can I do it in pyspark ? Thank you

It's pretty simple: <pre class="prettyprint"><code>from pyspark.sql.functions import col, concat, lit df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3"))) </code></pre> Use <code>concat</code> to concatenate all the columns with the <code>-</code> separator, for which you will need to use <code>lit</code>. If it doesn't directly work, you can use <code>cast</code> to change the column types to string, <code>col("column_1").cast("string")</code> UPDATE: Or you can use a more dynamic approach using a built-in function <code>concat_ws</code> <blockquote> pyspark.sql.functions.concat_ws(sep, *cols) <pre class="prettyprint"><code>Concatenates multiple input string columns together into a single string column, using the given separator. >>> df = spark.createDataFrame([('abcd','123')], ['s', 'd']) >>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect() [Row(s=u'abcd-123')] </code></pre> </blockquote> Code: <pre class="prettyprint"><code>from pyspark.sql.functions import col, concat_ws concat_columns = ["column_1", "column_2", "column_3"] df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns])) </code></pre>

Here is a <code>generic/dynamic</code> way of doing this, instead of <code>manually</code> concatenating it. All we need is to specify the columns that we need to concatenate. <pre class="prettyprint"><code># Importing requisite functions. from pyspark.sql.functions import col, udf # Creating the DataFrame df = spark.createDataFrame([(1,12,34,67),(2,45,78,90),(3,23,93,56)],['id','column_1','column_2','column_3']) </code></pre> Now, specifying the list of columns we want to concatenate, separated by <code>-</code>. <pre class="prettyprint"><code>list_of_columns_to_join = ['column_1','column_2','column_3'] </code></pre> Finally, creating a <code>UDF</code>. Mind it, <code>UDF</code> based solutions are implicitly slower. <pre class="prettyprint"><code>def concat_cols(*list_cols): return '-'.join(list([str(i) for i in list_cols])) concat_cols = udf(concat_cols) df = df.withColumn('column_join', concat_cols(*list_of_columns_to_join)) df.show() +---+--------+--------+--------+-----------+ | id|column_1|column_2|column_3|column_join| +---+--------+--------+--------+-----------+ | 1| 12| 34| 67| 12-34-67| | 2| 45| 78| 90| 45-78-90| | 3| 23| 93| 56| 23-93-56| +---+--------+--------+--------+-----------+ </code></pre>

How to concatenate multiple columns in PySpark with a separator?

I have a pyspark Dataframe, I would like to join 3 columns.

id |  column_1   | column_2    | column_3
--------------------------------------------
1  |     12      |   34        |    67
--------------------------------------------
2  |     45      |   78        |    90
--------------------------------------------
3  |     23      |   93        |    56
--------------------------------------------

I want to join the 3 columns : column_1, column_2, column_3 in only one adding between there value "-"

Expect result:

id |  column_1   | column_2    | column_3    |   column_join
-------------------------------------------------------------
1  |     12      |     34      |     67      |   12-34-67
-------------------------------------------------------------
2  |     45      |     78      |     90      |   45-78-90
-------------------------------------------------------------
3  |     23      |     93      |     56      |   23-93-56
-------------------------------------------------------------

How can I do it in pyspark ? Thank you

How do you concatenate two values in Pyspark?

Concatenating columns in pyspark is accomplished using concat() Function. Concatenating two columns is accomplished using concat() Function. Concatenating multiple columns is accomplished using concat() Function. Concatenating columns in pyspark is accomplished using concat() Function.

How do you split columns in Pyspark?

The PySpark SQL provides the split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame It can be done by splitting the string column on the delimiter like space, comma, pipe, etc.

How do you zip two columns in Pyspark?

We iterate over the items in column_names and column_values to create a list of the pairs, and then use list(chain. from_iterable(...)) to flatten the list. After the list is made, you can select the field by name.

It's pretty simple:

from pyspark.sql.functions import col, concat, lit

df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3")))

Use concat to concatenate all the columns with the - separator, for which you will need to use lit.

If it doesn't directly work, you can use cast to change the column types to string, col("column_1").cast("string")

UPDATE:

Or you can use a more dynamic approach using a built-in function concat_ws

pyspark.sql.functions.concat_ws(sep, *cols)

Concatenates multiple input string columns together into a single string column, using the given separator.

>>> df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
>>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect()
[Row(s=u'abcd-123')]

Code:

from pyspark.sql.functions import col, concat_ws

concat_columns = ["column_1", "column_2", "column_3"]
df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns]))

Here is a generic/dynamic way of doing this, instead of manually concatenating it. All we need is to specify the columns that we need to concatenate.

# Importing requisite functions.
from pyspark.sql.functions import col, udf

# Creating the DataFrame
df = spark.createDataFrame([(1,12,34,67),(2,45,78,90),(3,23,93,56)],['id','column_1','column_2','column_3'])

Now, specifying the list of columns we want to concatenate, separated by -.

list_of_columns_to_join = ['column_1','column_2','column_3']

Finally, creating a UDF. Mind it, UDF based solutions are implicitly slower.

def concat_cols(*list_cols):
    return '-'.join(list([str(i) for i in list_cols]))

concat_cols = udf(concat_cols)
df = df.withColumn('column_join', concat_cols(*list_of_columns_to_join))
df.show()
+---+--------+--------+--------+-----------+
| id|column_1|column_2|column_3|column_join|
+---+--------+--------+--------+-----------+
|  1|      12|      34|      67|   12-34-67|
|  2|      45|      78|      90|   45-78-90|
|  3|      23|      93|      56|   23-93-56|
+---+--------+--------+--------+-----------+

How to concatenate multiple columns in PySpark with a separator?

Tags:

apache-spark

apache-spark-sql

pyspark

verojoucla

People also ask

2 Answers

pissall

cph_sto

Recent Activity

Donate For Us

How to concatenate multiple columns in PySpark with a separator?

Tags:

apache-spark

apache-spark-sql

pyspark

verojoucla

People also ask

2 Answers

pissall

cph_sto

Related questions

Recent Activity

Donate For Us