pyspark

Question

I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings

For Instance I have 2 columns formed from calling collect_set

Fruits                  |    Meat
[Apple,Orange,Pear]          [Beef, Chicken, Pork]

How do I turn it into:

Food

[Apple,Orange,Pear, Beef, Chicken, Pork]

Thank you very much for your help in advance

Czechnology · Accepted Answer

I was also figuring this out in Python, so here is a port of Ramesh's solution to Python:

df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
                           ("Fruits", "Meat"))
df.show(1,False)

from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)

Output:

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+

Kudos to Ramesh!

EDIT: Note that you might have to manually specify the column type (not sure why it worked for me only in some cases without explicit type specification - in other cases I was getting a string type column).

from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))

Ramesh Maharjan · Answer

Given that you have dataframe as

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+

You can write a udf function to merge the sets of two columns into one.

import org.apache.spark.sql.functions._
def mergeCols = udf((fruits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => fruits ++ meat)

And then call the udf function as

df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false)

You should have your desired final dataframe

+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+

pyspark - merge 2 columns of sets

Tags:

apache-spark

pyspark-sql

soulless

2 Answers

Czechnology

Ramesh Maharjan

Recent Activity

Donate For Us