Keep only duplicates from a DataFrame regarding some field

Tags:

I have this spark DataFrame:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT|  QWA|     6|null|    08:59:00|    23:30:00|
|ALT|AUTRE|     2|null|    08:58:00|    23:29:00|
|TDR|  QWA|     3|null|    08:57:00|    23:28:00|
|ALT| TEST|     4|null|    08:56:00|    23:27:00|
|ALT|  QWA|     6|null|    08:55:00|    23:26:00|
|ALT|  QWA|     2|null|    08:54:00|    23:25:00|
|ALT|  QWA|     2|null|    08:53:00|    23:24:00|
+---+-----+------+----+------------+------------+

I want to get a new dataframe with only the lines that are not unique regarding the 3 fields "ID", "ID2" and "Number".

It means that I want this DataFrame:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT|  QWA|     6|null|    08:59:00|    23:30:00|
|ALT|  QWA|     2|null|    08:53:00|    23:24:00|
+---+-----+------+----+------------+------------+

Or maybe a dataframe with all the duplicates:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT|  QWA|     6|null|    08:59:00|    23:30:00|
|ALT|  QWA|     6|null|    08:55:00|    23:26:00|
|ALT|  QWA|     2|null|    08:54:00|    23:25:00|
|ALT|  QWA|     2|null|    08:53:00|    23:24:00|
+---+-----+------+----+------------+------------+

652

asked Mar 29 '18 15:03

Anneso

1 Answers

One way to do this is by using a pyspark.sql.Window to add a column that counts the number of duplicates for each row's ("ID", "ID2", "Name") combination. Then select only the rows where the number of duplicate is greater than 1.

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.partitionBy('ID', 'ID2', 'Number')
df.select('*', f.count('ID').over(w).alias('dupeCount'))\
    .where('dupeCount > 1')\
    .drop('dupeCount')\
    .show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA|     2|null|    08:54:00|    23:25:00|
#|ALT|QWA|     2|null|    08:53:00|    23:24:00|
#|ALT|QWA|     6|null|    08:59:00|    23:30:00|
#|ALT|QWA|     6|null|    08:55:00|    23:26:00|
#+---+---+------+----+------------+------------+

I used pyspark.sql.functions.count() to count the number of items in each group. This returns a DataFrame containing all of the duplicates (the second output you showed).

If you wanted to get only one row per ("ID", "ID2", "Name") combination, you could do using another Window to order the rows.

For example, below I add another column for the row_number and select only the rows where the duplicate count is greater than 1 and the row number is equal to 1. This guarantees one row per grouping.

w2 = Window.partitionBy('ID', 'ID2', 'Number').orderBy('ID', 'ID2', 'Number')
df.select(
        '*',
        f.count('ID').over(w).alias('dupeCount'),
        f.row_number().over(w2).alias('rowNum')
    )\
    .where('(dupeCount > 1) AND (rowNum = 1)')\
    .drop('dupeCount', 'rowNum')\
    .show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA|     2|null|    08:54:00|    23:25:00|
#|ALT|QWA|     6|null|    08:59:00|    23:30:00|
#+---+---+------+----+------------+------------+

answered Nov 13 '22 10:11

pault

Related questions
                            
                                How to get rid of "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties" message?
                            
                                Is there a way to filter a field not containing something in a spark dataframe using scala?
                            
                                Spark SQL change format of the number
                            
                                key not found: _PYSPARK_DRIVER_CALLBACK_HOST
                            
                                Error while using Hive context in spark : object hive is not a member of package org.apache.spark.sql
                            
                                Scala/Spark version compatibility
                            
                                Selecting only numeric/string columns names from a Spark DF in pyspark
                            
                                How to allocate more executors per worker in Standalone cluster mode?
                            
                                PySpark - Adding a Column from a list of values using a UDF
                            
                                spark partition data writing by timestamp
                            
                                Invalid Spark URL in local spark session
                            
                                UnsatisfiedLinkError: no snappyjava in java.library.path when running Spark MLLib Unit test within Intellij
                            
                                How can I efficiently read multiple json files into a Dataframe or JavaRDD?
                            
                                spark error RDD type not found when creating RDD
                            
                                What is the best way to define custom methods on a DataFrame?
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession
                            
                                Apply same function to all fields of spark dataframe row
                            
                                Pyspark: Replacing value in a column by searching a dictionary
                            
                                pyspark and HDFS commands
                            
                                Making histogram with Spark DataFrame column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Keep only duplicates from a DataFrame regarding some field

Tags:

apache-spark

pyspark

spark-dataframe

Anneso

People also ask

1 Answers

pault

Recent Activity

Donate For Us