Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how is using a LIMIT clause different from simply fetching N results?

In my application, I have a table of responses to a topic. The structure is roughly as follows:

CREATE TABLE responses (
    id INT NOT NULL PRIMARY KEY,
    topic_id INT NOT NULL,
    author_id INT NOT NULL,
    response TEXT
);

id is an auto-increment field, topic_id and author_id are foreign keys, there are the appropriate indices etc.

I always want to order by insertion time, usually by most recent. In most cases, I will be filtering by topic_id. A typical query looks like this:

SELECT * FROM responses WHERE topic_id=123 ORDER BY id DESC LIMIT 20;
-- or, for pagination:
SELECT * FROM responses WHERE topic_id=123 AND id < 456789 ORDER BY id DESC LIMIT 20;

I want to implement a blocklist - each user has a list of author_ids they don't want to see. I need to retrieve top 20 results excluding those author_ids and responses that reply to them.

Determining whether a row should be excluded is pretty complex, and while it would probably be possible to do this in the database (either in PL/SQL or by preprocessing), I want to keep the logic within the application. So I can do one of two things:

  1. Forget the LIMIT clause, leaving the query unbounded. Eat rows until I count 20 valid results, then close the query.
  2. Apply chunking - specify LIMIT 40 and hope that it's enough for 20 "good" results. If not, fetch next 40 and so on.

What is the practical difference between the two? Esp. in terms of performance with many simultaneous users.

I'm doing this in PostgreSQL, but I'm willing to switch to a different RDBMS. (i don't want to lose referential integrity, so i'm not looking into NoSQL solutions) Perhaps I'd have to tune some parameters of the database (such as prefetch sizes), to make the most of the unbounded query case?

like image 365
matejcik Avatar asked Oct 07 '22 01:10

matejcik


2 Answers

I can't speak to the specifics of Postgres, but it's possible that the query optimizer will use the LIMIT clause as part of the costing of various different execution plans.

If you ...

select ... from ... where ... limit n

then the optimiser knows that you are only going to retrieve n rows, but for ...

select ... from ... where ... 

the optimiser might assume that you want the entire result set, which may be estimated at several thousand rows.

In particular I'd expect an RDBMS to favour index-based access methods where LIMIT clauses are applied.

like image 141
David Aldridge Avatar answered Oct 10 '22 02:10

David Aldridge


Adding a block-list in SQL isn't difficult.

SELECT * FROM responses 
WHERE topic_id=123 
    AND author_id NOT IN (SELECT author_id FROM blocked WHERE user_id = X)
ORDER BY id DESC LIMIT 20;

Just add a NOT IN to your WHERE clause.

If you have some reason you can't do this then your chunk idea is best. You don't want to have no limit because then the database will return everything to the client or server querying it.

like image 31
Louis Ricci Avatar answered Oct 10 '22 03:10

Louis Ricci