I need to find duplicate record (with master record id and duplicate record ids):
select ciid, name from (
select ciid, name, row_number() over (
partition by related_id, name order by updatedate desc) rn
) where rn = 1;
This gives me the master record IDs, but it also includes records without duplicates.
If I use
select ciid, name from (
select ciid, name, row_number() over (
partition by related_id, name order by updatedate desc) rn
) where rn > 1;
This gets me all the duplicate records, but not the master record.
I was wishing if I do something like:
select ciid, name from (
select ciid, name, row_number() over (
partition by related_id, name order by updatedate desc
) rn, count(*) over (
partition by related_id, name order by updatedate desc
) cnt
) where rn = 1 and cnt > 1;
But I was worried about the performance, or even is it actually doing what I want.
How do I get the master record only for the ones with duplicates? Please note that name
is not unique column. Only ciid
is unique.
SQL Count with Partition By clause is one of the new powerful syntax that t-sql developers can easily use. For example, while selecting a list of rows you can also calculate count of rows sharing the same field values without using subselects or SQL CTE (Common Table Expressions) in your query.
PARTITION BY It is an optional clause in the ROW_NUMBER function. It is a clause that divides the result set into partitions (groups of rows). The ROW_NUMBER() method is then applied to each partition, which assigns a separate rank number to each partition.
The ROW_NUMBER function cannot currently be used in a WHERE clause. Derby does not currently support ORDER BY in subqueries, so there is currently no way to guarantee the order of rows in the SELECT subquery.
The row_number() window function can be used without order by in over to arbitrarily assign a unique value to each row.
I ended up using similar query in my question:
select ciid, name from (
select ciid, name, row_number() over (
partition by related_id, name order by updatedate desc
) rn, count(*) over (
partition by related_id, name desc
) cnt
) where rn = 1 and cnt > 1;
Works surprisingly well. The master record is where rn = 1 and duplicates are where rn > 1. Make sure count(*) over (partition ..)
cannot have order by
clause.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With