When SELECT querying one table l, no joins, with billions of rows, is it a good idea to run concurrent queries by splitting the query into multiple queries, split into distinct subsets/ranges by the indexes column, say integer primary key id?
Or does Postgres internally do this already, leading to no significant gain in speed for the end user?
I have two use cases:
getting the total count of rows
getting the list of ids
Edit-1: The query has conditional clause on columns where one of the columns is not indexed, and the other columns are indexed
SELECT id
FROM l
WHERE indexed_column-1='A'
AND indexed_column-2='B'
AND not_indexed_column-1='C'
Postgres has parallelization built in since version 9.6. (Improved in current versions.) It will be much more efficient than manually splitting a SELECT on a big table.
You can set the number of max_parallel_workers to your needs to optimize.
While you are only interested in the id column, it may help to have an index on (id) (given if it's the PK) and fulfill prerequisites for an index-only scan.
In the case where you want to count the number of rows, you can just let PostgreSQL's internal query parallelization do the work. It will be faster, and the result will be consistent.
In the case where you want to get the list of primary keys, it depends on the WHERE conditions of the query. If you are selecting only a few rows, parallel query will do nicely.
If you want all ids of the table, PostgreSQL will probably not choose a parallel plan, because the cost of exchanging so many values between the worker processes will outweigh the advantages of parallelization. In that case, you may be faster with parallel sessions as you envision.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With