I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the <code>dropDuplicates</code> method to work. It keeps returning the error: <blockquote> "AttributeError: 'list' object has no attribute 'dropDuplicates'" </blockquote> Not quite sure why as I seem to be following the syntax in the latest documentation. <pre class="prettyprint"><code>#loading the CSV file into an RDD in order to start working with the data rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect() #loading the RDD object into a dataframe and assigning column names df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect() #dropping duplicates from the dataframe df1.dropDuplicates().show() </code></pre>

It is not an import problem. You simply call <code>.dropDuplicates()</code> on a wrong object. While class of <code>sqlContext.createDataFrame(rdd1, ...)</code> is <code>pyspark.sql.dataframe.DataFrame</code>, after you apply <code>.collect()</code> it is a plain Python <code>list</code>, and lists don't provide <code>dropDuplicates</code> method. What you want is something like this: <pre class="prettyprint"><code> (df1 = sqlContext .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']) .dropDuplicates()) df1.collect() </code></pre>

if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'): count before dedupe: <pre class="prettyprint"><code>df.count() </code></pre> do the de-dupe (convert the column you are de-duping to string type): <pre class="prettyprint"><code>from pyspark.sql.functions import col df = df.withColumn('colName',col('colName').cast('string')) df.drop_duplicates(subset=['colName']).count() </code></pre> can use a sorted groupby to check to see that duplicates have been removed: <pre class="prettyprint"><code>df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False) </code></pre>

Remove duplicates from a dataframe in PySpark

Tags:

python

duplicates

apache-spark

pyspark

pyspark-dataframes

I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error:

"AttributeError: 'list' object has no attribute 'dropDuplicates'"

Not quite sure why as I seem to be following the syntax in the latest documentation.

#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()

#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()

#dropping duplicates from the dataframe
df1.dropDuplicates().show()

438

asked Jun 26 '15 03:06

Jared

2 Answers

It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:

 (df1 = sqlContext      .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])      .dropDuplicates())   df1.collect()

149

answered Sep 21 '22 02:09

zero323

if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):

count before dedupe:

df.count()

do the de-dupe (convert the column you are de-duping to string type):

from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))

df.drop_duplicates(subset=['colName']).count()

can use a sorted groupby to check to see that duplicates have been removed:

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)

answered Sep 21 '22 02:09

Grant Shannon

Related questions
                            
                                Pickling an enum exposed by Boost.Python
                            
                                Python: is it possible to mix generator and a recursive function?
                            
                                Setting the system date in Python (on Windows)
                            
                                Pass QuerySet object in template. Django
                            
                                Foreign key relationships missing when reflecting db in SqlAlchemy
                            
                                How do I create a unix timestamp that doesn't adjust for localtime?
                            
                                What is the best way to connect to a sybase database from python?
                            
                                How to validate an xml file against an XSD Schema using Amara library in Python?
                            
                                Pythonic way to assign an instance of a subclass to a variable when a specific string is presented to the constructor of the parent class
                            
                                python: simple approach to killing children or reporting their success?
                            
                                How to reverse django feed url?
                            
                                Saving a temporary file
                            
                                PyQt + QtWebkit behind a proxy
                            
                                I have trouble installing the django-socialregistration app!
                            
                                How to 'package' a simple, one-file python script for a person that wants to pay for it?
                            
                                Algorithm to detect similar documents in python script [closed]
                            
                                Finding patterns in list
                            
                                Python Pandas - Using to_sql to write large data frames in chunks
                            
                                Plotly animated slider in Python
                            
                                Python: Capitalize a word using string.format()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With