Let's say I have a rather large dataset in the following form: <pre class="prettyprint"><code>data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,'CA',2), ('Bar',72,'CA',2), ('Baz',22,'US',6), ('Baz',36,'US',6)]) </code></pre> What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only. Removing entirely duplicate rows is straightforward: <pre class="prettyprint"><code>data = data.distinct() </code></pre> and either row 5 or row 6 will be removed But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these: <pre class="prettyprint"><code>('Baz',22,'US',6) ('Baz',36,'US',6) </code></pre> In Python, this could be done by specifying columns with <code>.drop_duplicates()</code>. How can I achieve the same in Spark/Pyspark?

Pyspark does include a <code>dropDuplicates()</code> method, which was introduced in 1.4. https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates <pre class="prettyprint"><code>>>> from pyspark.sql import Row >>> df = sc.parallelize([ \ ... Row(name='Alice', age=5, height=80), \ ... Row(name='Alice', age=5, height=80), \ ... Row(name='Alice', age=10, height=80)]).toDF() >>> df.dropDuplicates().show() +---+------+-----+ |age|height| name| +---+------+-----+ | 5| 80|Alice| | 10| 80|Alice| +---+------+-----+ >>> df.dropDuplicates(['name', 'height']).show() +---+------+-----+ |age|height| name| +---+------+-----+ | 5| 80|Alice| +---+------+-----+ </code></pre>

From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates. Here is some code to get you started: <pre class="prettyprint"><code>def get_key(x): return "{0}{1}{2}".format(x[0],x[2],x[3]) m = data.map(lambda x: (get_key(x),x)) </code></pre> Now, you have a key-value <code>RDD</code> that is keyed by columns 1,3 and 4. The next step would be either a <code>reduceByKey</code> or <code>groupByKey</code> and <code>filter</code>. This would eliminate duplicates. <pre class="prettyprint"><code>r = m.reduceByKey(lambda x,y: (x)) </code></pre>

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Tags:

apache-spark

apache-spark-sql

pyspark

Let's say I have a rather large dataset in the following form:

data = sc.parallelize([('Foo',41,'US',3),                        ('Foo',39,'UK',1),                        ('Bar',57,'CA',2),                        ('Bar',72,'CA',2),                        ('Baz',22,'US',6),                        ('Baz',36,'US',6)])

What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.

Removing entirely duplicate rows is straightforward:

data = data.distinct()

and either row 5 or row 6 will be removed

But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:

('Baz',22,'US',6) ('Baz',36,'US',6)

In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/Pyspark?

541

asked May 14 '15 22:05

Jason

2 Answers

Pyspark does include a dropDuplicates() method, which was introduced in 1.4. https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates

>>> from pyspark.sql import Row >>> df = sc.parallelize([ \ ...     Row(name='Alice', age=5, height=80), \ ...     Row(name='Alice', age=5, height=80), \ ...     Row(name='Alice', age=10, height=80)]).toDF() >>> df.dropDuplicates().show() +---+------+-----+ |age|height| name| +---+------+-----+ |  5|    80|Alice| | 10|    80|Alice| +---+------+-----+  >>> df.dropDuplicates(['name', 'height']).show() +---+------+-----+ |age|height| name| +---+------+-----+ |  5|    80|Alice| +---+------+-----+

120

answered Sep 23 '22 01:09

vaer-k

From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.

Here is some code to get you started:

def get_key(x):     return "{0}{1}{2}".format(x[0],x[2],x[3])  m = data.map(lambda x: (get_key(x),x))

Now, you have a key-value RDD that is keyed by columns 1,3 and 4. The next step would be either a reduceByKey or groupByKey and filter. This would eliminate duplicates.

r = m.reduceByKey(lambda x,y: (x))

answered Sep 23 '22 01:09

Mike

Related questions
                            
                                Querying Spark SQL DataFrame with complex types
                            
                                How to make good reproducible Apache Spark examples
                            
                                How to use JDBC source to write and read data in (Py)Spark?
                            
                                Cannot find col function in pyspark
                            
                                pyspark dataframe filter or include based on list
                            
                                how to filter out a null value from spark dataframe
                            
                                How to find median and quantiles using Spark
                            
                                Pyspark: Split multiple array columns into rows
                            
                                What is the relationship between workers, worker instances, and executors?
                            
                                Is it possible to get the current spark context settings in PySpark?
                            
                                How to pivot Spark DataFrame?
                            
                                how to make saveAsTextFile NOT split output into multiple file?
                            
                                How to prevent java.lang.OutOfMemoryError: PermGen space at Scala compilation?
                            
                                Pyspark: Exception: Java gateway process exited before sending the driver its port number
                            
                                How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
                            
                                Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey
                            
                                Which cluster type should I choose for Spark?
                            
                                How does HashPartitioner work?
                            
                                How to link PyCharm with PySpark?
                            
                                How to pass -D parameter or environment variable to Spark job?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With