Where I am working I have been recently told that using distinct in your queries is a bad sign of a programmer. So I am wondering I guess the only way to not use this function is to use a group by .
It was my understanding that the distinct function works very similarly to a group by except in how its read. A distinct function checks each individual selection criteria vs a group by which does the same thing only done as a whole.
Keep in mind I only do reporting . I do not create/alter the data. So my question is for best practices should I be using distinct or group by. If neither then is there an alternative. Maybe the group by should be used in more complex queries than my non-real example here, but you get the idea. I could not find an answer that really explained why or why not I should use distinct in my queries
select distinct
spriden_user_id as "ID",
spriden_last_name as "last",
spriden_first_name as "first",
spriden_mi_name as "MI",
spraddr_street_line1 as "Street",
spraddr_street_line2 as "Street2",
spraddr_city as "city",
spraddr_stat_code as "State",
spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'
VS
select
spriden_user_id as "ID",
spriden_last_name as "last",
spriden_first_name as "first",
spriden_mi_name as "MI",
spraddr_street_line1 as "Street",
spraddr_street_line2 as "Street2",
spraddr_city as "city",
spraddr_stat_code as "State",
spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'
group by "ID","last","first","MI","Street","Street2","city","State","zip"
Is SQL DISTINCT good (or bad) when you need to remove duplicates in results? Some say it's good and add DISTINCT when duplicates appear. Some say it's bad and suggest using GROUP BY without an aggregate function. Others say DISTINCT and GROUP BY are the same when you need to remove duplicates.
As a general rule, SELECT DISTINCT incurs a fair amount of overhead for the query. Hence, you should avoid it or use it sparingly. The idea of generating duplicate rows using JOIN just to remove them with SELECT DISTINCT is rather reminiscent of Sisyphus pushing a rock up a hill, only to have it roll back down again.
SQL DISTINCT clause is used to remove the duplicates columns from the result set. The distinct keyword is used with select keyword in conjunction. It is helpful when we avoid duplicate values present in the specific columns/tables. The unique values are fetched when we use the distinct keyword.
Yes, basically it has to sort the results and then re-processed to eliminate the duplicates. This cull could also be being done during the sort, but we can only speculate as to how exactly the code works in the background. You could try and improve the performance by creating an index composed of all three (3) fields.
Databases are smart to recognize what you mean. I expect both of your queries to perform equally well. It is important for someone else maintaining your query to know what you meant. If you really meant to retrieve distinct records, use DISTINCT
. If your intention was to do aggregation, use GROUP BY
Take a look at this question. There are some nice answers that might help.
The answer provided by @zedfoxus is useful to understand the context.
However, I don't believe your query should require distinct records if the data is designed correctly.
It appears you are selecting the primary key of table spriden
, so all that data should be unique. You're also joining onto the spraddr
table; does that table really contain valid duplicate data? Or is there perhaps an additional join criterium that's required to filter out those duplicates?
This is why I get nervous about use of "distinct
" - the spraddr
table may include additional columns which you should use to filter out data, and "distinct
" may be hiding that.
Also, you may be generating a massive result set which needs to be filtered by the "distinct" clause, which can cause performance issues. For instance, if there are 1 million rows in spraddr
for each row in spriden
, and you should use the "is_current" flag to find the 2 or 3 "real" ones.
Finally, I get nervous when I see "group by" used as a substitute for distinct, not because it's "wrong", but because stylistically, I believe group by should be used for aggregate functions. That's just a personal preference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With