I have requirement where i need to count number of duplicate rows in SparkSQL for Hive tables.
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql import Row
app_name="test"
conf = SparkConf().setAppName(app_name)
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
df = sqlContext.sql("select * from DV_BDFRAWZPH_NOGBD_R000_SG.employee")
As of now i have hardcoded the table name, but it actually comes as parameter. That being said we don't know the number of columns or their names as well.In python pandas we have something like df.duplicated.sum() to count number of duplicate records. Do we have something like this here?
+---+---+---+
| 1 | A | B |
+---+---+---+
| 1 | A | B |
+---+---+---+
| 2 | B | E |
+---+---+---+
| 2 | B | E |
+---+---+---+
| 3 | D | G |
+---+---+---+
| 4 | D | G |
+---+---+---+
Here number of duplicate rows are 4. (for example)
In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.
Finding duplicate rows To find duplicates on a specific column, we can simply call duplicated() method on the column. The result is a boolean Series with the value True denoting duplicate.
You can count the number of distinct rows on a set of columns and compare it with the number of total rows. If they are the same, there is no duplicate rows. If the number of distinct rows is less than the total number of rows, duplicates exist.
➠ Find complete row duplicates: GroupBy can be used along with count() aggregate function on all the columns (using df. ➠ Find column level duplicates: GroupBy with required columns can be used along with count() aggregate function and then filter can be used to get duplicate records.
You essentially want to groupBy()
all the columns and count()
, then select the sum of the counts for the rows where the count is greater than 1.
import pyspark.sql.functions as f
df.groupBy(df.columns)\
.count()\
.where(f.col('count') > 1)\
.select(f.sum('count'))\
.show()
Explanation
After the grouping and aggregation, your data will look like this:
+---+---+---+---+
| 1 | A | B | 2 |
+---+---+---+---+
| 2 | B | E | 2 |
+---+---+---+---+
| 3 | D | G | 1 |
+---+---+---+---+
| 4 | D | G | 1 |
+---+---+---+---+
Then use where()
to filter only the rows with a count greater than 1, and select the sum. In this case, you will get the first 2 rows, which sum to 4.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With