I made a bit of a fool out of myself earlier today on this question. The question was using SQL Server, and the correct answer involved adding a HAVING
clause. The initial mistake I made was to think that an alias in the SELECT
statement could be used in the HAVING
clause, which is not allowed in SQL Server. I made this error because I assumed that SQL Server had the same rules as MySQL, which does allow an alias to be used in the HAVING
clause.
This got me curious, and I poked around on Stack Overflow and elsewhere, finding a bunch of material explaining why these rules are enforced on the two respective RDBMS. But nowhere did I find an explanation of what the performance implications would be of allowing/disallowing an alias in the HAVING
clause.
To give a concrete example, I will duplicate the query which occurred in the above-mentioned question:
SELECT students.camID, campus.camName, COUNT(students.stuID) as studentCount
FROM students
JOIN campus
ON campus.camID = students.camID
GROUP BY students.camID, campus.camName
HAVING COUNT(students.stuID) > 3
ORDER BY studentCount
What would be the performance implications of using an alias in the HAVING
clause instead of re specifying the COUNT
? This question can answered directly in MySQL, and hopefully someone could give insight into what would happen in SQL if it were to support the alias in the HAVING
clause.
This is a rare instance where it might be OK to tag a SQL question with both MySQL and SQL Server, so enjoy this moment in the sun.
A column alias cannot be used in a SELECT, WHERE, GROUP BY or HAVING clause due to the order of execution. You must refer to the original column name.
SQL aliases are used to give a table, or a column in a table, a temporary name. Aliases are often used to make column names more readable. An alias only exists for the duration of that query. An alias is created with the AS keyword.
column_alias can be used in an ORDER BY clause, but it cannot be used in a WHERE, GROUP BY, or HAVING clause. Standard SQL disallows references to column aliases in a WHERE clause. This restriction is imposed because when the WHERE clause is evaluated, the column value may not yet have been determined.
Benefits of SQL AliasesAllows you to provide more readable names to the column headers when they're presented in the results. Allows client applications to refer to a calculated field by name where no column name exists. Allows you to reduce code and make your queries more concise.
Narrowly focused on just that particular query, and with sample data loaded below. This does address some other queries such as the count(distinct ...)
mentioned by others.
The alias in the HAVING
appears to either slightly outperform or quite a bit outperform its alternative (depending on the query).
This uses a pre-existing table with about 5 million rows in it created quickly via this answer of mine which takes 3 to 5 minutes.
Resulting structure:
CREATE TABLE `ratings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;
But using INNODB instead. Creates the expected INNODB gap anomaly due to the range reservation inserts. Just saying, but makes no difference. 4.7 Million rows.
Modify the table to get near Tim's assumed schema.
rename table ratings to students; -- not exactly instanteous (a COPY)
alter table students add column camId int; -- get it near Tim's schema
-- don't add the `camId` index yet
The following will take a while. Run it again and again in chunks or else your connection may timeout. The timeout is due to 5 million rows without a LIMIT clause in the update statement. Note, we do have a LIMIT clause.
So we are doing it in half a million row iterations. Sets a column to a random number between 1 and 20
update students set camId=floor(rand()*20+1) where camId is null limit 500000; -- well that took a while (no surprise)
Keep running the above until no camId
is null.
I ran it like 10 times (the whole thing takes 7 to 10 minutes)
select camId,count(*) from students
group by camId order by 1 ;
1 235641
2 236060
3 236249
4 235736
5 236333
6 235540
7 235870
8 236815
9 235950
10 235594
11 236504
12 236483
13 235656
14 236264
15 236050
16 236176
17 236097
18 235239
19 235556
20 234779
select count(*) from students;
-- 4.7 Million rows
Create a useful index (after the inserts of course).
create index `ix_stu_cam` on students(camId); -- takes 45 seconds
ANALYZE TABLE students; -- update the stats: http://dev.mysql.com/doc/refman/5.7/en/analyze-table.html
-- the above is fine, takes 1 second
Create the campus table.
create table campus
( camID int auto_increment primary key,
camName varchar(100) not null
);
insert campus(camName) values
('one'),('2'),('3'),('4'),('5'),
('6'),('7'),('8'),('9'),('ten'),
('etc'),('etc'),('etc'),('etc'),('etc'),
('etc'),('etc'),('etc'),('etc'),('twenty');
-- ok 20 of them
Run the two queries:
SELECT students.camID, campus.camName, COUNT(students.id) as studentCount
FROM students
JOIN campus
ON campus.camID = students.camID
GROUP BY students.camID, campus.camName
HAVING COUNT(students.id) > 3
ORDER BY studentCount;
-- run it many many times, back to back, 5.50 seconds, 20 rows of output
and
SELECT students.camID, campus.camName, COUNT(students.id) as studentCount
FROM students
JOIN campus
ON campus.camID = students.camID
GROUP BY students.camID, campus.camName
HAVING studentCount > 3
ORDER BY studentCount;
-- run it many many times, back to back, 5.50 seconds, 20 rows of output
So the times are identical. Ran each a dozen times.
The EXPLAIN
output is the same for both
+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+
| 1 | SIMPLE | campus | ALL | PRIMARY | NULL | NULL | NULL | 20 | Using temporary; Using filesort |
| 1 | SIMPLE | students | ref | ix_stu_cam | ix_stu_cam | 5 | bigtest.campus.camID | 123766 | Using index |
+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+
Using the AVG() function, I am getting about a 12% increase in performance with the alias in the having
(with identical EXPLAIN
output) from the following two queries.
SELECT students.camID, campus.camName, avg(students.id) as studentAvg
FROM students
JOIN campus
ON campus.camID = students.camID
GROUP BY students.camID, campus.camName
HAVING avg(students.id) > 2200000
ORDER BY students.camID;
-- avg time 7.5
explain
SELECT students.camID, campus.camName, avg(students.id) as studentAvg
FROM students
JOIN campus
ON campus.camID = students.camID
GROUP BY students.camID, campus.camName
HAVING studentAvg > 2200000
ORDER BY students.camID;
-- avg time 6.5
And lastly, the DISTINCT
:
SELECT students.camID, count(distinct students.id) as studentDistinct
FROM students
JOIN campus
ON campus.camID = students.camID
GROUP BY students.camID
HAVING count(distinct students.id) > 1000000
ORDER BY students.camID; -- 10.6 10.84 12.1 11.49 10.1 9.97 10.27 11.53 9.84 9.98
-- 9.9
SELECT students.camID, count(distinct students.id) as studentDistinct
FROM students
JOIN campus
ON campus.camID = students.camID
GROUP BY students.camID
HAVING studentDistinct > 1000000
ORDER BY students.camID; -- 6.81 6.55 6.75 6.31 7.11 6.36 6.55
-- 6.45
The alias in the having consistently runs 35% faster with the same EXPLAIN
output. Seen below. So the same Explain output has been shown twice to not result in the same performance, but as a general clue.
+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+
| 1 | SIMPLE | campus | index | PRIMARY | PRIMARY | 4 | NULL | 20 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | students | ref | ix_stu_cam | ix_stu_cam | 5 | bigtest.campus.camID | 123766 | Using index |
+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+
The Optimizer Appears to favor the alias in the having at the moment, especially for the DISTINCT.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With