Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a Spark Dataframe column from vector to a set?

I need to process a dataset to identify frequent itemsets. So the input column must be a vector. The original column is a string with the items separated by comma, so i did the following:

functions.split(out_1['skills'], ',')

The problem is the, for some rows, I have duplicated values in the skills and this is causing an error when trying to identify the frequent itemsets.

I wanted to convert the vector to a set to remove the duplicated elements. Something like this:

functions.to_set(functions.split(out_1['skills'], ','))

But I could not find a function to convert a column from vector to set, i.e., there is no to_set function.

How can I accomplish what I want, i.e., remove the duplicated elements from the vector?

like image 622
Jeff Avatar asked Nov 01 '25 10:11

Jeff


1 Answers

It is recommended, when possible, to use native spark functions instead of UDFs for efficiency reasons. There is a dedicated function to leave only unique items in an array column: array_distinct() introduced in spark 2.4.0

from pyspark import Row
from pyspark.shell import spark
import pyspark.sql.functions as F

df = spark.createDataFrame([
    Row(skills='a,a,b,c'),
    Row(skills='a,b,c'),
    Row(skills='c,d,e,e'),
])

df = df.withColumn('skills_arr', F.array_distinct(F.split(df.skills, ",")))

result:

+-------+----------+
|skills |skills_arr|
+-------+----------+
|a,a,b,c|[a, b, c] |
|a,b,c  |[a, b, c] |
|c,d,e,e|[c, d, e] |
+-------+----------+
like image 196
l_po Avatar answered Nov 03 '25 23:11

l_po



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!