Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicates from a dataframe in PySpark

I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error:

"AttributeError: 'list' object has no attribute 'dropDuplicates'"

Not quite sure why as I seem to be following the syntax in the latest documentation.

#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()

#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()

#dropping duplicates from the dataframe
df1.dropDuplicates().show()
like image 438
Jared Avatar asked Jun 26 '15 03:06

Jared


People also ask

How do I remove duplicates in spark DataFrame?

The Spark DataFrame API comes with two functions that can be used in order to remove duplicates from a given DataFrame. These are distinct() and dropDuplicates() .

How do you delete duplicate columns in PySpark?

Removing duplicate columns after join in PySpark If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.

How do you get distinct records in PySpark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.


2 Answers

It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:

 (df1 = sqlContext      .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])      .dropDuplicates())   df1.collect() 
like image 149
zero323 Avatar answered Sep 21 '22 02:09

zero323


if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):

count before dedupe:

df.count()

do the de-dupe (convert the column you are de-duping to string type):

from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))

df.drop_duplicates(subset=['colName']).count()

can use a sorted groupby to check to see that duplicates have been removed:

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)
like image 34
Grant Shannon Avatar answered Sep 21 '22 02:09

Grant Shannon