I need to process a dataset to identify frequent itemsets. So the input column must be a vector. The original column is a string with the items separated by comma, so i did the following:
functions.split(out_1['skills'], ',')
The problem is the, for some rows, I have duplicated values in the skills and this is causing an error when trying to identify the frequent itemsets.
I wanted to convert the vector to a set to remove the duplicated elements. Something like this:
functions.to_set(functions.split(out_1['skills'], ','))
But I could not find a function to convert a column from vector to set, i.e., there is no to_set function.
How can I accomplish what I want, i.e., remove the duplicated elements from the vector?
It is recommended, when possible, to use native spark functions instead of UDFs for efficiency reasons. There is a dedicated function to leave only unique items in an array column: array_distinct() introduced in spark 2.4.0
from pyspark import Row
from pyspark.shell import spark
import pyspark.sql.functions as F
df = spark.createDataFrame([
Row(skills='a,a,b,c'),
Row(skills='a,b,c'),
Row(skills='c,d,e,e'),
])
df = df.withColumn('skills_arr', F.array_distinct(F.split(df.skills, ",")))
result:
+-------+----------+
|skills |skills_arr|
+-------+----------+
|a,a,b,c|[a, b, c] |
|a,b,c |[a, b, c] |
|c,d,e,e|[c, d, e] |
+-------+----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With