Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark DataFrame unable to drop duplicates

Hello I have created a spark dataframe, and I am trying to remove duplicates:

df.drop_duplicates(subset='id')

I get the following error:

Py4JError: An error occurred while calling z:org.apache.spark.api.python.PythonUtils.toSeq. Trace:
py4j.Py4JException: Method toSeq([class java.lang.String]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:360)
    at py4j.Gateway.invoke(Gateway.java:254)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

Am using osx 10.11.4, spark 1.6.1

I ran a jupyter notebook like this

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark

Is there some other configurations that I might have missed out or got wrong?

like image 768
Max Avatar asked May 07 '16 05:05

Max


People also ask

How do I drop duplicates in PySpark keep last?

dropduplicates(): Pyspark dataframe provides dropduplicates() function that is used to drop duplicate occurrences of data inside a dataframe. The function takes Column names as parameters concerning which the duplicate values have to be removed.

Does PySpark Union drop duplicates?

Note: In other SQL languages, Union eliminates the duplicates but UnionAll merges two datasets including duplicate records. But, in PySpark both behave the same and recommend using DataFrame duplicate() function to remove duplicate rows.

How do I remove duplicates in spark?

Duplicate rows could be remove or drop from Spark SQL DataFrame using distinct() and dropDuplicates() functions, distinct() can be used to remove rows that have the same values on all columns whereas dropDuplicates() can be used to remove rows that have the same values on multiple selected columns.


1 Answers

Argument for drop_duplicates / dropDuplicates should be a collection of names, which Java equivalent can be converted to Scala Seq, not a single string. You can use either a list:

df.drop_duplicates(subset=['id'])

or a tuple:

df.drop_duplicates(subset=('id', ))
like image 100
zero323 Avatar answered Sep 22 '22 18:09

zero323