I have the following table as a RDD:
Key Value
1 y
1 y
1 y
1 n
1 n
2 y
2 n
2 n
I want to remove all the duplicates from Value
.
Output should come like this:
Key Value
1 y
1 n
2 y
2 n
While working in pyspark, output should come as list of key-value pairs like this:
[(u'1',u'n'),(u'2',u'n')]
I don't know how to apply for
loop here. In a normal Python program it would have been very easy.
I wonder if there is some function in pyspark
for the same.
If you want to remove all duplicates from a particular column or set of columns, i.e doing a distinct on set of columns, then pyspark has the function dropDuplicates , which will accept specific set of columns to distinct on.
Removing duplicate columns after join in PySpark If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.
Note: In other SQL languages, Union eliminates the duplicates but UnionAll merges two datasets including duplicate records. But, in PySpark both behave the same and recommend using DataFrame duplicate() function to remove duplicate rows.
I am afraid I have no knowledge about python, so all the references and code I provide in this answer are relative to java. However, it should not be very difficult to translate it into python code.
You should take a look to the following webpage. It redirects to Spark's official web page, which provides a list of all the transformations and actions supported by Spark.
If I am not mistaken, the best approach (in your case) would be to use the distinct()
transformation, which returns a new dataset that contains the distinct elements of the source dataset (taken from link). In java, it would be something like:
JavaPairRDD<Integer,String> myDataSet = //already obtained somewhere else
JavaPairRDD<Integer,String> distinctSet = myDataSet.distinct();
So that, for example:
Partition 1:
1-y | 1-y | 1-y | 2-y
2-y | 2-n | 1-n | 1-n
Partition 2:
2-g | 1-y | 2-y | 2-n
1-y | 2-n | 1-n | 1-n
Would get converted to:
Partition 1:
1-y | 2-y
1-n | 2-n
Partition 2:
1-y | 2-g | 2-y
1-n | 2-n |
Of course, you still would have multiple RDD dataSets each wich a list of distinct elements.
This problem is simple to solve using the distinct
operation of the pyspark library from Apache Spark.
from pyspark import SparkContext, SparkConf
# Set up a SparkContext for local testing
if __name__ == "__main__":
sc = SparkContext(appName="distinctTuples", conf=SparkConf().set("spark.driver.host", "localhost"))
# Define the dataset
dataset = [(u'1',u'y'),(u'1',u'y'),(u'1',u'y'),(u'1',u'n'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n'),(u'2',u'n')]
# Parallelize and partition the dataset
# so that the partitions can be operated
# upon via multiple worker processes.
allTuplesRdd = sc.parallelize(dataset, 4)
# Filter out duplicates
distinctTuplesRdd = allTuplesRdd.distinct()
# Merge the results from all of the workers
# into the driver process.
distinctTuples = distinctTuplesRdd.collect()
print 'Output: %s' % distinctTuples
This will output the following:
Output: [(u'1',u'y'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n')]
If you want to remove all duplicates from a particular column or set of columns, i.e doing a distinct
on set of columns, then pyspark has the function dropDuplicates
, which will accept specific set of columns to distinct on.
aka
df.dropDuplicates(['value']).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With