Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I use distinct in my queries

Tags:

Where I am working I have been recently told that using distinct in your queries is a bad sign of a programmer. So I am wondering I guess the only way to not use this function is to use a group by .

It was my understanding that the distinct function works very similarly to a group by except in how its read. A distinct function checks each individual selection criteria vs a group by which does the same thing only done as a whole.

Keep in mind I only do reporting . I do not create/alter the data. So my question is for best practices should I be using distinct or group by. If neither then is there an alternative. Maybe the group by should be used in more complex queries than my non-real example here, but you get the idea. I could not find an answer that really explained why or why not I should use distinct in my queries

select distinct
    spriden_user_id as "ID",
    spriden_last_name as "last",
    spriden_first_name as "first",
    spriden_mi_name as "MI",
    spraddr_street_line1 as "Street",
    spraddr_street_line2 as "Street2",
    spraddr_city as "city",
    spraddr_stat_code as "State",
    spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'

VS

select
    spriden_user_id as "ID",
    spriden_last_name as "last",
    spriden_first_name as "first",
    spriden_mi_name as "MI",
    spraddr_street_line1 as "Street",
    spraddr_street_line2 as "Street2",
    spraddr_city as "city",
    spraddr_stat_code as "State",
    spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'
group by "ID","last","first","MI","Street","Street2","city","State","zip"     
like image 768
Taku_ Avatar asked Nov 11 '15 13:11

Taku_


People also ask

Is it bad to use distinct in SQL?

Is SQL DISTINCT good (or bad) when you need to remove duplicates in results? Some say it's good and add DISTINCT when duplicates appear. Some say it's bad and suggest using GROUP BY without an aggregate function. Others say DISTINCT and GROUP BY are the same when you need to remove duplicates.

Why you shouldn't use SELECT distinct?

As a general rule, SELECT DISTINCT incurs a fair amount of overhead for the query. Hence, you should avoid it or use it sparingly. The idea of generating duplicate rows using JOIN just to remove them with SELECT DISTINCT is rather reminiscent of Sisyphus pushing a rock up a hill, only to have it roll back down again.

When should you use distinct in SQL?

SQL DISTINCT clause is used to remove the duplicates columns from the result set. The distinct keyword is used with select keyword in conjunction. It is helpful when we avoid duplicate values present in the specific columns/tables. The unique values are fetched when we use the distinct keyword.

Does distinct affect query performance?

Yes, basically it has to sort the results and then re-processed to eliminate the duplicates. This cull could also be being done during the sort, but we can only speculate as to how exactly the code works in the background. You could try and improve the performance by creating an index composed of all three (3) fields.


2 Answers

Databases are smart to recognize what you mean. I expect both of your queries to perform equally well. It is important for someone else maintaining your query to know what you meant. If you really meant to retrieve distinct records, use DISTINCT. If your intention was to do aggregation, use GROUP BY

Take a look at this question. There are some nice answers that might help.

like image 145
zedfoxus Avatar answered Oct 14 '22 19:10

zedfoxus


The answer provided by @zedfoxus is useful to understand the context.

However, I don't believe your query should require distinct records if the data is designed correctly.

It appears you are selecting the primary key of table spriden, so all that data should be unique. You're also joining onto the spraddr table; does that table really contain valid duplicate data? Or is there perhaps an additional join criterium that's required to filter out those duplicates?

This is why I get nervous about use of "distinct" - the spraddr table may include additional columns which you should use to filter out data, and "distinct" may be hiding that.

Also, you may be generating a massive result set which needs to be filtered by the "distinct" clause, which can cause performance issues. For instance, if there are 1 million rows in spraddr for each row in spriden, and you should use the "is_current" flag to find the 2 or 3 "real" ones.

Finally, I get nervous when I see "group by" used as a substitute for distinct, not because it's "wrong", but because stylistically, I believe group by should be used for aggregate functions. That's just a personal preference.

like image 26
Neville Kuyt Avatar answered Oct 14 '22 21:10

Neville Kuyt