I'm trying to create a MySQL query that will return all individual rows (not grouped) containing duplicate values from within a group of related records. By 'groups of related records' I mean those with the same account number (per the sample below).
Basically, within each group of related records that share the same distinct account number, select just those rows whose values for the date
or amount
columns are the same as another row's values within that account's group of records. Values should only be considered duplicate from within that account's group. The sample table and ideal output details below should clear things up.
Also, I'm not concerned with any records with a status of X being returned, even if they have duplicate values.
id account invoice date amount status
1 1 1 2012-04-01 0 X
2 1 2 2012-04-01 120 P
3 1 2 2012-05-01 120 U
4 1 3 2012-05-01 117 U
5 2 4 2012-04-01 82 X
6 2 4 2012-05-01 82 U
7 2 5 2012-03-01 81 P
8 2 6 2012-05-01 80 U
9 3 7 2012-03-01 80 P
10 3 8 2012-04-01 79 U
11 3 9 2012-04-01 78 U
id account invoice date amount status
2 1 2 2012-04-01 120 P
3 1 2 2012-05-01 120 U
4 1 3 2012-05-01 117 U
6 2 4 2012-05-01 82 U
8 2 6 2012-05-01 80 U
10 3 8 2012-04-01 79 U
11 3 9 2012-04-01 78 U
Thus, row 7/9 and 8/9 should not both be returned because their duplicate values are not considered duplicate from within the scope of their respective accounts. However, row 8 should be returned because it shares a duplicate value with row 6.
Later, I may want to further hone the selection by grabbing only duplicate rows that have matching statuses, thus row 2 would be excluded because it does't match the other two found within that account's group of records. How much more difficult would that make the query? Would it just be a matter of adding a WHERE or HAVING clause, or is it more complicated than that?
I hope my explanation of what I'm trying to accomplish makes sense. I've tried using INNER JOIN but that returns each desired row more than once. I don't want duplicates of duplicates.
CREATE TABLE payment (
id int(11) NOT NULL auto_increment,
account int(10) NOT NULL default '0',
invoice int(10) NOT NULL default '0',
date date NOT NULL default '0000-00-00',
amount int(10) NOT NULL default '0',
status char(1) NOT NULL default '',
PRIMARY KEY (id)
);
INSERT INTO payment VALUES (1, 1, 1, '2012-04-01', 0, 'X');
INSERT INTO payment VALUES (2, 1, 2, '2012-04-01', 120, 'P');
INSERT INTO payment VALUES (3, 1, 2, '2012-05-01', 120, 'U');
INSERT INTO payment VALUES (4, 1, 3, '2012-05-01', 117, 'U');
INSERT INTO payment VALUES (5, 2, 4, '2012-04-01', 82, 'X');
INSERT INTO payment VALUES (6, 2, 4, '2012-05-01', 82, 'U');
INSERT INTO payment VALUES (7, 2, 5, '2012-03-01', 81, 'p');
INSERT INTO payment VALUES (8, 2, 6, '2012-05-01', 80, 'U');
INSERT INTO payment VALUES (9, 3, 7, '2012-03-01', 80, 'U');
INSERT INTO payment VALUES (10, 3, 8, '2012-04-01', 79, 'U');
INSERT INTO payment VALUES (11, 3, 9, '2012-04-01', 78, 'U');
This type of query can be implemented as a semi join.
Semijoins are used to select rows from one of the tables in the join.
For example:
select distinct l.*
from payment l
inner join payment r
on
l.id != r.id and l.account = r.account and
(l.date = r.date or l.amount = r.amount)
where l.status != 'X' and r.status != 'X'
order by l.id asc;
Note the use of distinct
, and that I'm only selecting columns from the left table. This ensures that there are no duplicates.
The join condition checks that:
l.id != r.id
)l.account = r.account
)l.date = r.date or l.amount = r.amount
)For the second part of your question, you would need to update the on
clause in the query.
This seems to work
select * from payment p1
join payment p2 on
(p1.id != p2.id
and p1.status != 'X'
and p1.account = p2.account
and (p1.amount = p2.amount or p1.date = p2.date))
group by p1.id
http://sqlfiddle.com/#!2/a50e9/3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With