I have a Spark sql dataframe, consisting of an <code>ID</code> column and <code>n</code> "data" columns, i.e. <pre class="prettyprint"><code>id | dat1 | dat2 | ... | datn </code></pre> The <code>id</code> columnn is uniquely determined, whereas, looking at <code>dat1 ... datn</code> there may be duplicates. My goal is to find the <code>id</code>s of those duplicates. My approach so far: <ul> <li> get the duplicate rows using <code>groupBy</code>: <code>dup_df = df.groupBy(df.columns[1:]).count().filter('count > 1') </code> </li> <li> join the <code>dup_df</code> with the entire <code>df</code> to get the duplicate rows including <code>id</code>: <code>df.join(dup_df, df.columns[1:]) </code> </li> </ul> I am quite certain that this is basically correct, it fails because the <code>dat1 ... datn</code> columns contain <code>null</code> values. To do the <code>join</code> on <code>null</code> values, I found .e.g this SO post. But this would require to construct a huge "string join condition". Thus my questions: <ol> <li>Is there a simple / more generic / more pythonic way to do <code>joins</code> on <code>null</code> values?</li> <li>Or, even better, is there another (easier, more beautiful, ...) method to get the desired <code>id</code>s?</li> </ol> BTW: I am using Spark 2.1.0 and Python 3.5.3

If number <code>ids</code> per group is relatively small you can <code>groupBy</code> and <code>collect_list</code>. Required imports <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import collect_list, size </code></pre> example data: <pre class="prettyprint lang-py prettyprint-override"><code>df = sc.parallelize([ (1, "a", "b", 3), (2, None, "f", None), (3, "g", "h", 4), (4, None, "f", None), (5, "a", "b", 3) ]).toDF(["id"]) </code></pre> query: <pre class="prettyprint"><code>(df .groupBy(df.columns[1:]) .agg(collect_list("id").alias("ids")) .where(size("ids") > 1)) </code></pre> and the result: <pre class="prettyprint lang-none prettyprint-override"><code>+----+---+----+------+ | _2| _3| _4| ids| +----+---+----+------+ |null| f|null|[2, 4]| | a| b| 3|[1, 5]| +----+---+----+------+ </code></pre> You can apply <code>explode</code> twice (or use an <code>udf</code>) to an output equivalent to the one returned from <code>join</code>. You can also identify groups using minimal <code>id</code> per group. A few additional imports: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.window import Window from pyspark.sql.functions import col, count, min </code></pre> window definition: <pre class="prettyprint lang-py prettyprint-override"><code>w = Window.partitionBy(df.columns[1:]) </code></pre> query: <pre class="prettyprint lang-py prettyprint-override"><code>(df .select( "*", count("*").over(w).alias("_cnt"), min("id").over(w).alias("group")) .where(col("_cnt") > 1)) </code></pre> and the result: <pre class="prettyprint lang-none prettyprint-override"><code>+---+----+---+----+----+-----+ | id| _2| _3| _4|_cnt|group| +---+----+---+----+----+-----+ | 2|null| f|null| 2| 2| | 4|null| f|null| 2| 2| | 1| a| b| 3| 2| 1| | 5| a| b| 3| 2| 1| +---+----+---+----+----+-----+ </code></pre> You can further use <code>group</code> column for self join.

Get IDs for duplicate rows (considering all other columns) in Apache Spark

I have a Spark sql dataframe, consisting of an ID column and n "data" columns, i.e.

id | dat1 | dat2 | ... | datn

The id columnn is uniquely determined, whereas, looking at dat1 ... datn there may be duplicates.

My goal is to find the ids of those duplicates.

My approach so far:

get the duplicate rows using groupBy:

dup_df = df.groupBy(df.columns[1:]).count().filter('count > 1')
join the dup_df with the entire df to get the duplicate rows including id:

df.join(dup_df, df.columns[1:])

I am quite certain that this is basically correct, it fails because the dat1 ... datn columns contain null values.

To do the join on null values, I found .e.g this SO post. But this would require to construct a huge "string join condition".

Thus my questions:

Is there a simple / more generic / more pythonic way to do joins on null values?
Or, even better, is there another (easier, more beautiful, ...) method to get the desired ids?

BTW: I am using Spark 2.1.0 and Python 3.5.3

How do I find duplicate rows in spark data frame?

➠ Find complete row duplicates: GroupBy can be used along with count() aggregate function on all the columns (using df. ➠ Find column level duplicates: GroupBy with required columns can be used along with count() aggregate function and then filter can be used to get duplicate records.

How do you drop duplicate rows based on one column in PySpark?

PySpark distinct() function is used to drop/remove the duplicate rows (all columns) from DataFrame and dropDuplicates() is used to drop rows based on selected (one or multiple) columns.

How can I see the number of duplicate rows?

You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .

How do you count duplicate rows in PySpark?

In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.

If number ids per group is relatively small you can groupBy and collect_list. Required imports

from pyspark.sql.functions import collect_list, size

example data:

df = sc.parallelize([
    (1, "a", "b", 3),
    (2, None, "f", None),
    (3, "g", "h", 4),
    (4, None, "f", None),
    (5, "a", "b", 3)
]).toDF(["id"])

query:

(df
   .groupBy(df.columns[1:])
   .agg(collect_list("id").alias("ids"))
   .where(size("ids") > 1))

and the result:

+----+---+----+------+
|  _2| _3|  _4|   ids|
+----+---+----+------+
|null|  f|null|[2, 4]|
|   a|  b|   3|[1, 5]|
+----+---+----+------+

You can apply explode twice (or use an udf) to an output equivalent to the one returned from join.

You can also identify groups using minimal id per group. A few additional imports:

from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, min

window definition:

w = Window.partitionBy(df.columns[1:])

query:

(df
    .select(
        "*", 
        count("*").over(w).alias("_cnt"), 
        min("id").over(w).alias("group"))
    .where(col("_cnt") > 1))

and the result:

+---+----+---+----+----+-----+
| id|  _2| _3|  _4|_cnt|group|
+---+----+---+----+----+-----+
|  2|null|  f|null|   2|    2|
|  4|null|  f|null|   2|    2|
|  1|   a|  b|   3|   2|    1|
|  5|   a|  b|   3|   2|    1|
+---+----+---+----+----+-----+

You can further use group column for self join.

Get IDs for duplicate rows (considering all other columns) in Apache Spark

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

akoeltringer

People also ask

1 Answers

zero323

Recent Activity

Donate For Us

Get IDs for duplicate rows (considering all other columns) in Apache Spark

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

akoeltringer

People also ask

1 Answers

zero323

Related questions

Recent Activity

Donate For Us