I have a pyspark Dataframe
, I would like to join 3 columns.
id | column_1 | column_2 | column_3
--------------------------------------------
1 | 12 | 34 | 67
--------------------------------------------
2 | 45 | 78 | 90
--------------------------------------------
3 | 23 | 93 | 56
--------------------------------------------
I want to join the 3 columns : column_1, column_2, column_3
in only one adding between there value "-"
Expect result:
id | column_1 | column_2 | column_3 | column_join
-------------------------------------------------------------
1 | 12 | 34 | 67 | 12-34-67
-------------------------------------------------------------
2 | 45 | 78 | 90 | 45-78-90
-------------------------------------------------------------
3 | 23 | 93 | 56 | 23-93-56
-------------------------------------------------------------
How can I do it in pyspark ? Thank you
Concatenating columns in pyspark is accomplished using concat() Function. Concatenating two columns is accomplished using concat() Function. Concatenating multiple columns is accomplished using concat() Function. Concatenating columns in pyspark is accomplished using concat() Function.
The PySpark SQL provides the split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame It can be done by splitting the string column on the delimiter like space, comma, pipe, etc.
We iterate over the items in column_names and column_values to create a list of the pairs, and then use list(chain. from_iterable(...)) to flatten the list. After the list is made, you can select the field by name.
It's pretty simple:
from pyspark.sql.functions import col, concat, lit
df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3")))
Use concat
to concatenate all the columns with the -
separator, for which you will need to use lit
.
If it doesn't directly work, you can use cast
to change the column types to string, col("column_1").cast("string")
UPDATE:
Or you can use a more dynamic approach using a built-in function concat_ws
pyspark.sql.functions.concat_ws(sep, *cols)
Concatenates multiple input string columns together into a single string column, using the given separator. >>> df = spark.createDataFrame([('abcd','123')], ['s', 'd']) >>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect() [Row(s=u'abcd-123')]
Code:
from pyspark.sql.functions import col, concat_ws
concat_columns = ["column_1", "column_2", "column_3"]
df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns]))
Here is a generic/dynamic
way of doing this, instead of manually
concatenating it. All we need is to specify the columns that we need to concatenate.
# Importing requisite functions.
from pyspark.sql.functions import col, udf
# Creating the DataFrame
df = spark.createDataFrame([(1,12,34,67),(2,45,78,90),(3,23,93,56)],['id','column_1','column_2','column_3'])
Now, specifying the list of columns we want to concatenate, separated by -
.
list_of_columns_to_join = ['column_1','column_2','column_3']
Finally, creating a UDF
. Mind it, UDF
based solutions are implicitly slower.
def concat_cols(*list_cols):
return '-'.join(list([str(i) for i in list_cols]))
concat_cols = udf(concat_cols)
df = df.withColumn('column_join', concat_cols(*list_of_columns_to_join))
df.show()
+---+--------+--------+--------+-----------+
| id|column_1|column_2|column_3|column_join|
+---+--------+--------+--------+-----------+
| 1| 12| 34| 67| 12-34-67|
| 2| 45| 78| 90| 45-78-90|
| 3| 23| 93| 56| 23-93-56|
+---+--------+--------+--------+-----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With