Consider a voting system implemented in PostgreSQL, where each user can vote up or down on a "foo". There is a <code>foo</code> table that stores all the "foo information", and a <code>votes</code> table that stores the <code>user_id</code>, <code>foo_id</code>, and <code>vote</code>, where <code>vote</code> is +1 or -1. To get the vote tally for each foo, the following query would work: <pre class="prettyprint"><code>SELECT sum(vote) FROM votes WHERE foo.foo_id = votes.foo_id; </code></pre> But, the following would work just as well: <pre class="prettyprint"><code>(SELECT count(vote) FROM votes WHERE foo.foo_id = votes.foo_id AND votes.vote = 1) - (SELECT count(vote) FROM votes WHERE foo.foo_id = votes.foo_id AND votes.vote = (-1)) </code></pre> I currently have an index on <code>votes.foo_id</code>. Which is a more efficient approach? (In other words, which would run faster?) I'm interested in both the PostgreSQL-specific answer and the general SQL answer. EDIT A lot of answers have been taking into account the case where <code>vote</code> is null. I forgot to mention that there is a <code>NOT NULL</code> constraint on the vote column. Also, many have been pointing out that the first is much easier to read. Yes, it is definitely true, and if a colleague wrote the 2nd one, I would be exploding with rage unless there was a performance necessity. Never the less, the question is still on the performance of the two. (Technically, if the first query was way slower, it wouldn't be such a crime to write the second query.)

Of course, the first example is faster, simpler and easier to read. Should be obvious even before one gets slapped with aquatic creatures. While <code>sum()</code> is slightly more expensive than <code>count()</code>, what matters much, much more is that the second example need two scans. But there is an actual difference, too: <code>sum()</code> can return <code>NULL</code> where <code>count()</code> doesn't. I quote the manual on aggregate functions: <blockquote> It should be noted that except for count, these functions return a null value when no rows are selected. In particular, sum of no rows returns null, not zero as one might expect, </blockquote> Since you seem to have a weak spot for performance optimization, here's a detail you might like: <code>count(*)</code> is slightly faster than <code>count(vote)</code>. Only equivalent if vote is <code>NOT NULL</code>. Test performance with <code>EXPLAIN ANALYZE</code>. <h3>On closer inspection</h3> Both queries are syntactical nonsense, standing alone. It only makes sense if you copied them from the <code>SELECT</code> list of a bigger query like: <pre class="prettyprint"><code>SELECT *, (SELECT sum(vote) FROM votes WHERE votes.foo_id = foo.foo_id) FROM foo; </code></pre> The important point here is the correlated subquery - which may be fine if you are only reading a small fraction of <code>votes</code> in your query. We would see additional <code>WHERE</code> conditions, and you should have matching indexes. In Postgres 9.3 or later, the alternative, cleaner, 100 % equivalent solution would be with <code>LEFT JOIN LATERAL ... ON true</code>: <pre class="prettyprint"><code>SELECT * FROM foo f LEFT JOIN LATERAL ( SELECT sum(vote) FROM votes WHERE foo_id = f.foo_id ) v ON true; </code></pre> Typically similar performance. Details: <ul> <li>What is the difference between LATERAL and a subquery in PostgreSQL?</li> </ul> However, while reading large parts or all from table <code>votes</code>, this will be (much) faster: <pre class="prettyprint"><code>SELECT f.*, v.score FROM foo f JOIN ( SELECT foo_id, sum(vote) AS score FROM votes GROUP BY 1 ) v USING (foo_id); </code></pre> Aggregate values in a subquery first, then join to the result. About <code>USING</code>: <ul> <li>Remove duplicate column after SQL query</li> </ul>

sum() vs. count()

Tags:

sql

postgresql

aggregate-functions

Consider a voting system implemented in PostgreSQL, where each user can vote up or down on a "foo". There is a foo table that stores all the "foo information", and a votes table that stores the user_id, foo_id, and vote, where vote is +1 or -1.

To get the vote tally for each foo, the following query would work:

Click to copy

SELECT sum(vote) FROM votes WHERE foo.foo_id = votes.foo_id;

But, the following would work just as well:

Click to copy

(SELECT count(vote) FROM votes 
 WHERE foo.foo_id = votes.foo_id 
 AND votes.vote = 1)
- (SELECT count(vote) FROM votes 
   WHERE foo.foo_id = votes.foo_id 
   AND votes.vote = (-1))

I currently have an index on votes.foo_id.

Which is a more efficient approach? (In other words, which would run faster?) I'm interested in both the PostgreSQL-specific answer and the general SQL answer.

EDIT

A lot of answers have been taking into account the case where vote is null. I forgot to mention that there is a NOT NULL constraint on the vote column.

Also, many have been pointing out that the first is much easier to read. Yes, it is definitely true, and if a colleague wrote the 2nd one, I would be exploding with rage unless there was a performance necessity. Never the less, the question is still on the performance of the two. (Technically, if the first query was way slower, it wouldn't be such a crime to write the second query.)

701

asked Feb 21 '13 09:02

ryanrhee

1 Answers

Of course, the first example is faster, simpler and easier to read. Should be obvious even before one gets slapped with aquatic creatures. While sum() is slightly more expensive than count(), what matters much, much more is that the second example need two scans.

But there is an actual difference, too: sum() can return NULL where count() doesn't. I quote the manual on aggregate functions:

It should be noted that except for count, these functions return a null value when no rows are selected. In particular, sum of no rows returns null, not zero as one might expect,

Since you seem to have a weak spot for performance optimization, here's a detail you might like: count(*) is slightly faster than count(vote). Only equivalent if vote is NOT NULL. Test performance with EXPLAIN ANALYZE.

On closer inspection

Both queries are syntactical nonsense, standing alone. It only makes sense if you copied them from the SELECT list of a bigger query like:

Click to copy

SELECT *, (SELECT sum(vote) FROM votes WHERE votes.foo_id = foo.foo_id)
FROM   foo;

The important point here is the correlated subquery - which may be fine if you are only reading a small fraction of votes in your query. We would see additional WHERE conditions, and you should have matching indexes.

In Postgres 9.3 or later, the alternative, cleaner, 100 % equivalent solution would be with LEFT JOIN LATERAL ... ON true:

Click to copy

SELECT *
FROM   foo f
LEFT   JOIN LATERAL (
   SELECT sum(vote) FROM votes WHERE foo_id = f.foo_id
   ) v ON true;

Typically similar performance. Details:

What is the difference between LATERAL and a subquery in PostgreSQL?

However, while reading large parts or all from table votes, this will be (much) faster:

Click to copy

SELECT f.*, v.score
FROM   foo f
JOIN   (
   SELECT foo_id, sum(vote) AS score
   FROM   votes
   GROUP  BY 1
   ) v USING (foo_id);

Aggregate values in a subquery first, then join to the result.
About USING:

Remove duplicate column after SQL query

153

answered Oct 04 '22 00:10

Erwin Brandstetter

Related questions
                            
                                Sequelize: overlapping - Checking if any value in array matches any value in the passed array
                            
                                Where clause when variable is false doesn't consider that into the SQL query?
                            
                                Find matching records based on dynamic columns
                            
                                How to check a SQL Server CE database for indexes?
                            
                                Inserting multiple rows into Oracle
                            
                                MySQL Question - How to handle multiple types of users - one table or multiple?
                            
                                How can we give multiple alias for a column in SQL?
                            
                                distinct() function (not select qualifier) in postgres
                            
                                Create Unqiue case-insensitive constraint on two varchar fields
                            
                                difference between like and regex operator
                            
                                SQL Error: ORA-02291: integrity constraint
                            
                                Recommended method to import a .csv file into Microsoft SQL Server 2008 R2?
                            
                                What is the advantage of common table expression in sql server
                            
                                Extra backslash \ when SELECT ... INTO OUTFILE ... in MySQL
                            
                                Is it bad practice to use temporary tables in SQL?
                            
                                My query runs faster the second time around, how do i stop that?
                            
                                Efficiently duplicate some rows in PostgreSQL table
                            
                                How to Update the Date without changing its Time Using SQL Server?
                            
                                Conditional value replacement in SQL Server
                            
                                Update table with random record in update statment in SQL Server?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sum() vs. count()

Tags:

sql

postgresql

aggregate-functions

ryanrhee

People also ask

1 Answers

On closer inspection

Erwin Brandstetter

Recent Activity

Donate For Us