How to remove elements from an array Column in Spark?

Tags:

I have a Seq and dataframe. The dataframe contains a column of array type. I am trying to remove the elements that are in the Seq from the column.

For example:

val stop_words = Seq("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

    +---------------------------------------------------+
    |sorted_items                                       |
    +---------------------------------------------------+
    |[flannel, and, for, s, shirts, sleeve, warm]       |
    |[3, 5, kitchenaid, s]                              |
    |[5, 6, case, flip, inch, iphone, on, xs]           |
    |[almonds, chocolate, covered, dark, joe, s, the]   |
    |null                                               |
    |[]                                                 |
    |[animation, book]                                  |

Expected output:

+---------------------------------------------------+
|sorted_items                                       |
+---------------------------------------------------+
|[flannel, shirts, sleeve, warm]                    |
|[3, 5, kitchenaid]                                 |
|[5, 6, case, flip, inch, iphone, xs]               |
|[almonds, chocolate, covered, dark, joe, the]      |
|null                                               |
|[]                                                 |
|[animation, book]                                  |

How can this be done in an effective and optimized way?

679

asked May 17 '19 06:05

2 Answers

Use array_except from spark.sql.functions:

import org.apache.spark.sql.{functions => F}

val stopWords = Array("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

val newDF = df.withColumn("sorted_items", F.array_except(df("sorted_items"), F.lit(stopWords)))

newDF.show(false)

Output:

+----------------------------------------+
|sorted_items                            |
+----------------------------------------+
|[flannel, shirts, sleeve, warm]         |
|[3, 5, kitchenaid]                      |
|[5, 6, case, flip, inch, iphone, xs]    |
|[almonds, chocolate, covered, dark, joe]|
|null                                    |
|[]                                      |
|[animation, book]                       |
+----------------------------------------+

133

answered Oct 21 '22 04:10

Use the StopWordsRemover from the MLlib package. It is possible to set custom stop words using the setStopWords function. StopWordsRemover will not handle null values so those will need to be dealt with before usage. It can be done as follows:

val df2 = df.withColumn("sorted_values", coalesce($"sorted_values", array()))

val remover = new StopWordsRemover()
  .setStopWords(stop_words.toArray)
  .setInputCol("sorted_values")
  .setOutputCol("filtered")

val df3 = remover.transform(df2)

answered Oct 21 '22 05:10

Shaido

Related questions
                            
                                Get string length function (strlen) of a constant char array is not a constant expression
                            
                                angularJS how to calculation for ng-repeat data shown in input element
                            
                                Why negative size of array is not a compilation error but throws java.lang.NegativeArraySizeException
                            
                                Limit array size in swift
                            
                                Angular 2 : How to filter records between two dates?
                            
                                Title Case JavaScript
                            
                                How are JavaScript arrays stored in memory
                            
                                Check equality of two axes in multidiimensional numpy array
                            
                                javascript - return the maximum accumulated profit
                            
                                Shift rows of a numpy array independently
                            
                                Combine sub-arrays by using as a key a sub-string found in the first element of each sub-array
                            
                                Formatting a number by a decimal
                            
                                (Array) List implementations in Java
                            
                                How do I flatten an array of arrays?
                            
                                Type safety: The expression of type Map[] needs unchecked conversion to conform to Map<String,Object>[]
                            
                                Threshold numpy array, find windows
                            
                                How to get the array index in Lodash _.each
                            
                                How to find all partitions of a multiset, where each part has distinct elements?
                            
                                Get element from json array javascript [closed]
                            
                                How to extract two columns from an array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to remove elements from an array Column in Spark?

Tags:

arrays

dataframe

scala

apache-spark

seq

user3407267

People also ask

2 Answers

gmds

Shaido

Recent Activity

Donate For Us