I have this spark DataFrame:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT|AUTRE| 2|null| 08:58:00| 23:29:00|
|TDR| QWA| 3|null| 08:57:00| 23:28:00|
|ALT| TEST| 4|null| 08:56:00| 23:27:00|
|ALT| QWA| 6|null| 08:55:00| 23:26:00|
|ALT| QWA| 2|null| 08:54:00| 23:25:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
I want to get a new dataframe with only the lines that are not unique regarding the 3 fields "ID"
, "ID2"
and "Number"
.
It means that I want this DataFrame:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
Or maybe a dataframe with all the duplicates:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT| QWA| 6|null| 08:55:00| 23:26:00|
|ALT| QWA| 2|null| 08:54:00| 23:25:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
In order to keep only duplicate rows in pyspark we will be using groupby function along with count() function. Secondly we filter the rows with count greater than 1.
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
To handle duplicate values, we may use a strategy in which we keep the first occurrence of the values and drop the rest. dropduplicates(): Pyspark dataframe provides dropduplicates() function that is used to drop duplicate occurrences of data inside a dataframe.
One way to do this is by using a pyspark.sql.Window
to add a column that counts the number of duplicates for each row's ("ID", "ID2", "Name")
combination. Then select only the rows where the number of duplicate is greater than 1.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('ID', 'ID2', 'Number')
df.select('*', f.count('ID').over(w).alias('dupeCount'))\
.where('dupeCount > 1')\
.drop('dupeCount')\
.show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA| 2|null| 08:54:00| 23:25:00|
#|ALT|QWA| 2|null| 08:53:00| 23:24:00|
#|ALT|QWA| 6|null| 08:59:00| 23:30:00|
#|ALT|QWA| 6|null| 08:55:00| 23:26:00|
#+---+---+------+----+------------+------------+
I used pyspark.sql.functions.count()
to count the number of items in each group. This returns a DataFrame containing all of the duplicates (the second output you showed).
If you wanted to get only one row per ("ID", "ID2", "Name")
combination, you could do using another Window to order the rows.
For example, below I add another column for the row_number
and select only the rows where the duplicate count is greater than 1 and the row number is equal to 1. This guarantees one row per grouping.
w2 = Window.partitionBy('ID', 'ID2', 'Number').orderBy('ID', 'ID2', 'Number')
df.select(
'*',
f.count('ID').over(w).alias('dupeCount'),
f.row_number().over(w2).alias('rowNum')
)\
.where('(dupeCount > 1) AND (rowNum = 1)')\
.drop('dupeCount', 'rowNum')\
.show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA| 2|null| 08:54:00| 23:25:00|
#|ALT|QWA| 6|null| 08:59:00| 23:30:00|
#+---+---+------+----+------------+------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With