Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicate values from a RDD[PYSPARK]

I have the following table as a RDD:

Key Value
1    y
1    y
1    y
1    n
1    n
2    y
2    n
2    n

I want to remove all the duplicates from Value.

Output should come like this:

Key Value
1    y
1    n
2    y
2    n

While working in pyspark, output should come as list of key-value pairs like this:

[(u'1',u'n'),(u'2',u'n')]

I don't know how to apply for loop here. In a normal Python program it would have been very easy.

I wonder if there is some function in pyspark for the same.

like image 784
Prince Bhatti Avatar asked Sep 18 '14 06:09

Prince Bhatti


People also ask

How do I remove duplicates in RDD PySpark?

If you want to remove all duplicates from a particular column or set of columns, i.e doing a distinct on set of columns, then pyspark has the function dropDuplicates , which will accept specific set of columns to distinct on.

How do you delete duplicate columns in PySpark?

Removing duplicate columns after join in PySpark If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.

Does PySpark Union remove duplicates?

Note: In other SQL languages, Union eliminates the duplicates but UnionAll merges two datasets including duplicate records. But, in PySpark both behave the same and recommend using DataFrame duplicate() function to remove duplicate rows.


3 Answers

I am afraid I have no knowledge about python, so all the references and code I provide in this answer are relative to java. However, it should not be very difficult to translate it into python code.

You should take a look to the following webpage. It redirects to Spark's official web page, which provides a list of all the transformations and actions supported by Spark.

If I am not mistaken, the best approach (in your case) would be to use the distinct() transformation, which returns a new dataset that contains the distinct elements of the source dataset (taken from link). In java, it would be something like:

JavaPairRDD<Integer,String> myDataSet = //already obtained somewhere else
JavaPairRDD<Integer,String> distinctSet = myDataSet.distinct();

So that, for example:

Partition 1:

1-y | 1-y | 1-y | 2-y
2-y | 2-n | 1-n | 1-n

Partition 2:

2-g | 1-y | 2-y | 2-n
1-y | 2-n | 1-n | 1-n

Would get converted to:

Partition 1:

1-y | 2-y
1-n | 2-n 

Partition 2:

1-y | 2-g | 2-y
1-n | 2-n |

Of course, you still would have multiple RDD dataSets each wich a list of distinct elements.

like image 169
Mikel Urkia Avatar answered Oct 17 '22 05:10

Mikel Urkia


This problem is simple to solve using the distinct operation of the pyspark library from Apache Spark.

from pyspark import SparkContext, SparkConf

# Set up a SparkContext for local testing
if __name__ == "__main__":
    sc = SparkContext(appName="distinctTuples", conf=SparkConf().set("spark.driver.host", "localhost"))

# Define the dataset
dataset = [(u'1',u'y'),(u'1',u'y'),(u'1',u'y'),(u'1',u'n'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n'),(u'2',u'n')]

# Parallelize and partition the dataset 
# so that the partitions can be operated
# upon via multiple worker processes.
allTuplesRdd = sc.parallelize(dataset, 4)

# Filter out duplicates
distinctTuplesRdd = allTuplesRdd.distinct() 

# Merge the results from all of the workers
# into the driver process.
distinctTuples = distinctTuplesRdd.collect()

print 'Output: %s' % distinctTuples

This will output the following:

Output: [(u'1',u'y'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n')]
like image 9
jsears Avatar answered Oct 17 '22 05:10

jsears


If you want to remove all duplicates from a particular column or set of columns, i.e doing a distinct on set of columns, then pyspark has the function dropDuplicates, which will accept specific set of columns to distinct on.

aka

df.dropDuplicates(['value']).show()
like image 4
captClueless Avatar answered Oct 17 '22 05:10

captClueless