Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove elements from an array Column in Spark?

I have a Seq and dataframe. The dataframe contains a column of array type. I am trying to remove the elements that are in the Seq from the column.

For example:

val stop_words = Seq("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

    +---------------------------------------------------+
    |sorted_items                                       |
    +---------------------------------------------------+
    |[flannel, and, for, s, shirts, sleeve, warm]       |
    |[3, 5, kitchenaid, s]                              |
    |[5, 6, case, flip, inch, iphone, on, xs]           |
    |[almonds, chocolate, covered, dark, joe, s, the]   |
    |null                                               |
    |[]                                                 |
    |[animation, book]                                  |

Expected output:

+---------------------------------------------------+
|sorted_items                                       |
+---------------------------------------------------+
|[flannel, shirts, sleeve, warm]                    |
|[3, 5, kitchenaid]                                 |
|[5, 6, case, flip, inch, iphone, xs]               |
|[almonds, chocolate, covered, dark, joe, the]      |
|null                                               |
|[]                                                 |
|[animation, book]                                  |

How can this be done in an effective and optimized way?

like image 679
user3407267 Avatar asked May 17 '19 06:05

user3407267


People also ask

How do you slice in PySpark?

In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices. The rows are split up RANDOMLY.


2 Answers

Use array_except from spark.sql.functions:

import org.apache.spark.sql.{functions => F}

val stopWords = Array("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

val newDF = df.withColumn("sorted_items", F.array_except(df("sorted_items"), F.lit(stopWords)))

newDF.show(false)

Output:

+----------------------------------------+
|sorted_items                            |
+----------------------------------------+
|[flannel, shirts, sleeve, warm]         |
|[3, 5, kitchenaid]                      |
|[5, 6, case, flip, inch, iphone, xs]    |
|[almonds, chocolate, covered, dark, joe]|
|null                                    |
|[]                                      |
|[animation, book]                       |
+----------------------------------------+
like image 133
gmds Avatar answered Oct 21 '22 04:10

gmds


Use the StopWordsRemover from the MLlib package. It is possible to set custom stop words using the setStopWords function. StopWordsRemover will not handle null values so those will need to be dealt with before usage. It can be done as follows:

val df2 = df.withColumn("sorted_values", coalesce($"sorted_values", array()))

val remover = new StopWordsRemover()
  .setStopWords(stop_words.toArray)
  .setInputCol("sorted_values")
  .setOutputCol("filtered")

val df3 = remover.transform(df2)
like image 21
Shaido Avatar answered Oct 21 '22 05:10

Shaido