Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicates within Spark array column

Tags:

I have a given DataSet :

+-------------------+--------------------+
|               date|            products|
+-------------------+--------------------+
|2017-08-31 22:00:00|[361, 361, 361, 3...|
|2017-09-22 22:00:00|[361, 362, 362, 3...|
|2017-09-21 22:00:00|[361, 361, 361, 3...|
|2017-09-28 22:00:00|[360, 361, 361, 3...|

where products column is an array of strings with possible duplicated items.

I would like to remove this duplication (within one row)

What I did is basically write an UDF function like that

 val removeDuplicates: WrappedArray[String] => WrappedArray[String] = _.distinct
 val udfremoveDuplicates = udf(removeDuplicates)

This solution gives me a proper results :

+-------------------+--------------------+--------------------+
|               date|            products|       rm_duplicates|
+-------------------+--------------------+--------------------+
|2017-08-31 22:00:00|[361, 361, 361, 3...|[361, 362, 363, 3...|
|2017-09-22 22:00:00|[361, 362, 362, 3...|[361, 362, 363, 3...|

My questions are :

  1. Do Spark provides a better/more efficient way of getting this result ?

  2. I was thinking about using a map - but how to get desired column as a List to be able to use 'distinct' method like in my removeDuplicates lambda ?

Edit: I marked this topic with java tag, because it does not matter to me in which language (scala or java) I will get an answear :) Edit2: typos

like image 564
zbyszekt Avatar asked Nov 12 '17 15:11

zbyszekt


People also ask

How do you remove duplicates from an array of arrays?

To remove duplicates from an array: First, convert an array of duplicates to a Set . The new Set will implicitly remove duplicate elements. Then, convert the set back to an array.

How do I remove duplicates in spark?

The Spark DataFrame API comes with two functions that can be used to remove duplicates from a given DataFrame. These are distinct() and dropDuplicates() .

How do you drop duplicates in a column in PySpark?

Removing duplicate columns after join in PySpark If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.

How do I remove duplicates from a list in PySpark?

PySpark distinct() function is used to drop/remove the duplicate rows (all columns) from DataFrame and dropDuplicates() is used to drop rows based on selected (one or multiple) columns. In this article, you will learn how to use distinct() and dropDuplicates() functions with PySpark example.


1 Answers

The approach presented in the question--using a UDF--is the best approach as spark-sql has no built-in primitive to uniquify arrays.

If you are dealing with massive amounts of data and/or the array values have unique properties then it's worth thinking about the implementation of the UDF.

WrappedArray.distinct builds a mutable.HashSet behind the scenes and then traverses it to build the array of distinct elements. There are two possible problems with this from a performance standpoint:

  1. Scala's mutable collections are not wonderfully efficient, which is why in the guts of Spark you'll find a lot of Java collections and while loops. If you are in need of extreme performance, you can implement your own generic distinct using faster data structures.

  2. A generic implementation of distinct does not take advantage of any properties of your data. For example, if the arrays will be small on average then a simple implementation that builds directly into an array and does a linear search for duplicates may perform much better than code that builds a complex data structure, despite it's theoretical O(n^2) complexity. For another example, if the values can only be numbers in a small range, or strings from a small set, you can implement uniquification via a bit set.

Again, these strategies should only be considered if you have ridiculous amounts of data. Your simple implementation is perfectly suitable for almost every situation.

like image 106
Sim Avatar answered Sep 22 '22 13:09

Sim