I have a Seq
and dataframe. The dataframe contains a column of array type. I am trying to remove the elements that are in the Seq
from the column.
For example:
val stop_words = Seq("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")
+---------------------------------------------------+
|sorted_items |
+---------------------------------------------------+
|[flannel, and, for, s, shirts, sleeve, warm] |
|[3, 5, kitchenaid, s] |
|[5, 6, case, flip, inch, iphone, on, xs] |
|[almonds, chocolate, covered, dark, joe, s, the] |
|null |
|[] |
|[animation, book] |
Expected output:
+---------------------------------------------------+
|sorted_items |
+---------------------------------------------------+
|[flannel, shirts, sleeve, warm] |
|[3, 5, kitchenaid] |
|[5, 6, case, flip, inch, iphone, xs] |
|[almonds, chocolate, covered, dark, joe, the] |
|null |
|[] |
|[animation, book] |
How can this be done in an effective and optimized way?
In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices. The rows are split up RANDOMLY.
Use array_except
from spark.sql.functions
:
import org.apache.spark.sql.{functions => F}
val stopWords = Array("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")
val newDF = df.withColumn("sorted_items", F.array_except(df("sorted_items"), F.lit(stopWords)))
newDF.show(false)
Output:
+----------------------------------------+
|sorted_items |
+----------------------------------------+
|[flannel, shirts, sleeve, warm] |
|[3, 5, kitchenaid] |
|[5, 6, case, flip, inch, iphone, xs] |
|[almonds, chocolate, covered, dark, joe]|
|null |
|[] |
|[animation, book] |
+----------------------------------------+
Use the StopWordsRemover
from the MLlib package. It is possible to set custom stop words using the setStopWords
function. StopWordsRemover
will not handle null values so those will need to be dealt with before usage. It can be done as follows:
val df2 = df.withColumn("sorted_values", coalesce($"sorted_values", array()))
val remover = new StopWordsRemover()
.setStopWords(stop_words.toArray)
.setInputCol("sorted_values")
.setOutputCol("filtered")
val df3 = remover.transform(df2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With