Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL to find duplicate entries (within a group)

Tags:

sql

join

oracle

SQL to find duplicate entries (within a group)

I have a small problem and I'm not sure what would be the best way to fix it, as I only have limited access to the database (Oracle) itself. In our Table "EVENT" we have about 160k entries, each EVENT has a GROUPID and a normal entry has exactly 5 rows with the same GROUPID. Due to a bug we currently get a couple of duplicate entries (duplicate, so 10 rows instead of 5, just a different EVENTID. This may change, so it's just <> 5). We need to filter all the entries of these groups.

Due to limited access to the database we can not use a temporary table, nor can we add an index to the GROUPID column to make it faster.

We can get the GROUPIDs with this query, but we would need a second query to get the needed data

select A."GROUPID"
from "EVENT" A
group by A."GROUPID"
having count(A."GROUPID") <> 5

One solution would be a subselect:

select *
from "EVENT" A
where A."GROUPID" IN (
  select B."GROUPID"
  from "EVENT" B
  group by B."GROUPID"
  having count(B."GROUPID") <> 5
)

Without an index on GROUPID and 160k entries, this takes much too long. Tried thinking about a join that can handle this, but can't find a good solution so far.

Anybody can find a good solution for this maybe?

Small edit: We don't have 100% duplicates here, as each entry still has a unique ID and the GROUPID is not unique either (that's why we need to use "group by") - or maybe I just miss an easy solution for it :)

Small example about the data (I don't want to delete it, just find it)

EVENTID | GROUPID | TYPEID
123456    123       12
123457    123       145
123458    123       2612
123459    123       41
123460    123       238

234567    123       12
234568    123       145
234569    123       2612
234570    123       41
234571    123       238

It has some more columns, like timestamp etc, but as you can see already, everything is identical, besides the EVENTID.

We will run it more often for testing, to find the bug and check if it happens again.

like image 679
FrankS Avatar asked Dec 01 '22 08:12

FrankS


1 Answers

A classic problem for analytic queries to solve:

select eventid,
       groupid,
       typeid
from   (
       Select eventid,
              groupid,
              typeid,
              count(*) over (partition by group_id) count_by_group_id
       from   EVENT
       )
where count_by_group_id <> 5
like image 143
David Aldridge Avatar answered Dec 26 '22 05:12

David Aldridge