How to remove duplicate values from a RDD[PYSPARK]

Tags:

I have the following table as a RDD:

Key Value
1    y
1    y
1    y
1    n
1    n
2    y
2    n
2    n

I want to remove all the duplicates from Value.

Output should come like this:

Key Value
1    y
1    n
2    y
2    n

While working in pyspark, output should come as list of key-value pairs like this:

[(u'1',u'n'),(u'2',u'n')]

I don't know how to apply for loop here. In a normal Python program it would have been very easy.

I wonder if there is some function in pyspark for the same.

784

asked Sep 18 '14 06:09

Prince Bhatti

3 Answers

I am afraid I have no knowledge about python, so all the references and code I provide in this answer are relative to java. However, it should not be very difficult to translate it into python code.

You should take a look to the following webpage. It redirects to Spark's official web page, which provides a list of all the transformations and actions supported by Spark.

If I am not mistaken, the best approach (in your case) would be to use the distinct() transformation, which returns a new dataset that contains the distinct elements of the source dataset (taken from link). In java, it would be something like:

JavaPairRDD<Integer,String> myDataSet = //already obtained somewhere else
JavaPairRDD<Integer,String> distinctSet = myDataSet.distinct();

So that, for example:

Partition 1:

1-y | 1-y | 1-y | 2-y
2-y | 2-n | 1-n | 1-n

Partition 2:

2-g | 1-y | 2-y | 2-n
1-y | 2-n | 1-n | 1-n

Would get converted to:

Partition 1:

1-y | 2-y
1-n | 2-n 

Partition 2:

1-y | 2-g | 2-y
1-n | 2-n |

Of course, you still would have multiple RDD dataSets each wich a list of distinct elements.

169

answered Oct 17 '22 05:10

Mikel Urkia

This problem is simple to solve using the distinct operation of the pyspark library from Apache Spark.

from pyspark import SparkContext, SparkConf

# Set up a SparkContext for local testing
if __name__ == "__main__":
    sc = SparkContext(appName="distinctTuples", conf=SparkConf().set("spark.driver.host", "localhost"))

# Define the dataset
dataset = [(u'1',u'y'),(u'1',u'y'),(u'1',u'y'),(u'1',u'n'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n'),(u'2',u'n')]

# Parallelize and partition the dataset 
# so that the partitions can be operated
# upon via multiple worker processes.
allTuplesRdd = sc.parallelize(dataset, 4)

# Filter out duplicates
distinctTuplesRdd = allTuplesRdd.distinct() 

# Merge the results from all of the workers
# into the driver process.
distinctTuples = distinctTuplesRdd.collect()

print 'Output: %s' % distinctTuples

This will output the following:

Output: [(u'1',u'y'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n')]

answered Oct 17 '22 05:10

jsears

If you want to remove all duplicates from a particular column or set of columns, i.e doing a distinct on set of columns, then pyspark has the function dropDuplicates, which will accept specific set of columns to distinct on.

aka

df.dropDuplicates(['value']).show()

answered Oct 17 '22 05:10

captClueless

Related questions
                            
                                What is a good place to store configuration in Google AppEngine (python)
                            
                                Checking if an ISBN number is correct
                            
                                Sending Meeting Invitations With Python
                            
                                testing for empty/null string in django
                            
                                How to change the dtype of certain columns of a numpy recarray?
                            
                                What is the advantage of using the native C++ Qt over PyQt [closed]
                            
                                Build query string using urlencode python
                            
                                SQL Alchemy ResultProxy.rowcount should not be zero
                            
                                Nicing a running python process?
                            
                                BeautifulSoup in Python - getting the n-th tag of a type
                            
                                line 60, in make_tuple return tuple(l) TypeError: iter() returned non-iterator of type 'Vector'
                            
                                How to check if the n-th element exists in a Python list?
                            
                                Adding records to a numpy record array
                            
                                python .count for multidimensional arrays (list of lists)
                            
                                Concurrent writing with sqlite3 [duplicate]
                            
                                How to write data from two lists into columns in a csv?
                            
                                KeyError when using .format on a string in Python [duplicate]
                            
                                Shuffle a numpy array
                            
                                Decrypting Chromium cookies
                            
                                How can I select a specific column from each row in a Pandas DataFrame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to remove duplicate values from a RDD[PYSPARK]

Tags:

python

apache-spark

rdd

Prince Bhatti

People also ask

3 Answers

Mikel Urkia

jsears

captClueless

Recent Activity

Donate For Us