How to convert a Spark Dataframe column from vector to a set?

Question

I need to process a dataset to identify frequent itemsets. So the input column must be a vector. The original column is a string with the items separated by comma, so i did the following:

functions.split(out_1['skills'], ',')

The problem is the, for some rows, I have duplicated values in the skills and this is causing an error when trying to identify the frequent itemsets.

I wanted to convert the vector to a set to remove the duplicated elements. Something like this:

functions.to_set(functions.split(out_1['skills'], ','))

But I could not find a function to convert a column from vector to set, i.e., there is no to_set function.

How can I accomplish what I want, i.e., remove the duplicated elements from the vector?

l_po · Accepted Answer

It is recommended, when possible, to use native spark functions instead of UDFs for efficiency reasons. There is a dedicated function to leave only unique items in an array column: array_distinct() introduced in spark 2.4.0

from pyspark import Row
from pyspark.shell import spark
import pyspark.sql.functions as F

df = spark.createDataFrame([
    Row(skills='a,a,b,c'),
    Row(skills='a,b,c'),
    Row(skills='c,d,e,e'),
])

df = df.withColumn('skills_arr', F.array_distinct(F.split(df.skills, ",")))

result:

+-------+----------+
|skills |skills_arr|
+-------+----------+
|a,a,b,c|[a, b, c] |
|a,b,c  |[a, b, c] |
|c,d,e,e|[c, d, e] |
+-------+----------+

How to convert a Spark Dataframe column from vector to a set?

Tags:

python

data-conversion

set

apache-spark-sql

pyspark

Jeff

1 Answers

l_po

Recent Activity

Donate For Us

How to convert a Spark Dataframe column from vector to a set?

Tags:

python

data-conversion

set

apache-spark-sql

pyspark

Jeff

1 Answers

l_po

Related questions

Recent Activity

Donate For Us