Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL: count() or keep a counter?

Tags:

sql

postgresql

I have two tables in a one-to-many relationship. Let's say that for each row in table foo, there can be 0 or more rows in table bar that reference the row in foo.

The client wants to know how many rows in bar reference a row in foo, for all the rows in foo.

I can accomplish this with the following query:

SELECT count(bar_id) FROM bar WHERE bar.foo_id = foo.foo_id;

However, what if the tables foo and bar were large? Say foo has 1 million rows, and bar has 10 million rows. Let's also say that 99% of rows in foo would have a count of less than 1,000 bar rows referencing it. Let's say that the client typically asks for around 100 rows of foo at a time.

Should I use the naive count() query with an index on the foreign key, or would it be better to keep a counter? Is it even possible to keep a counter? By updating the counter with atomic increments and decrements using a trigger on bar, I believe it's possible, but I could be wrong.

like image 611
ryanrhee Avatar asked Jan 14 '23 03:01

ryanrhee


2 Answers

Perhaps counter-intuitively, you'll probably find that the simple count approach is faster unless your workload is very biased towards reads.

The reason for this is that the effect of the counter table will be to serialize updates, so only one transaction that's updating a given foo can be in flight at any given time. That's because the update for the trigger that updates the counter will take a lock on that foo's entry in the counter table and won't release it until the transaction rolls back or commits.

Worse, if your transaction affects more than one foo and so does another one, you have a high chance of one of the transactions being aborted due to a deadlock.

Stick to a simple count until you have a good reason to change it.

like image 66
Craig Ringer Avatar answered Jan 17 '23 03:01

Craig Ringer


The sweet thing about indices is that they offer logarithmic complexity for querying operations. Thus, for 10*10^6 rows, the index just needs about ln(10*10^6)=16.1 comparisons to find one specific id. Make it 100 million rows, and you only have to do 2 to 3 comparisons more. In short: The index does not that much care about the size of a table.

Of course, you may still be able to archive some performance gains using a stored counter. That is a typical tradeoff. Maintaining the counter will make insertions and deletions to bar much more expensive and make your counting query a little bit cheaper.

Thus, if your tables are altered rarely and the query is run frequently (say, thousands of times an hour), you might really gain performance using the stored counter procedure. However, in most cases I would say go for the indexed column and let the database do the rest for you.

like image 40
Thilo Avatar answered Jan 17 '23 03:01

Thilo