Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL. Can run update query in parallel?

I have a big table with 10M rows. And I need to get some statistic value for each row. I have function that generate this value, for example GetStatistic(uuid). This functions works very slow and result value changes not often, so i've created column Statistic in my table, and once a day execute query like this:

UPDATE MyTable SET Statistic = GetStatistic(ID);

And in select queries i use column Statistic without calling GetStatistic functions.

Problem is, that my production server has 64 CPUs and a lot of memory, so nearly all DB can be cached to RAM, but this query use only one CPU and need 2 or 3 hours to execute.

GetStatistic function use table, that are constant during all execution of UPDATE query. Can i modify query to get postgre to calculate GetStatistic in parallel for different rows simultaneously, using all avaliable CPUs?

like image 806
Yavanosta Avatar asked Oct 17 '12 09:10

Yavanosta


People also ask

How many updates per second can Postgres handle?

When using Postgres if you do need writes exceeding 10,000s of INSERT s per second we turn to the Postgres COPY utility for bulk loading. COPY is capable of handling 100,000s of writes per second. Even without a sustained high write throughput COPY can be handy to quickly ingest a very large set of data.

What is parallel query in PostgreSQL?

Parallel queries in PostgreSQL have the ability to use more than one CPU core per query. In parallel queries the optimizer breaks down the query tasks into smaller parts and spreads each task across multiple CPU cores.

How many concurrent queries can Postgres handle?

At provision, Databases for PostgreSQL sets the maximum number of connections to your PostgreSQL database to 115. 15 connections are reserved for the superuser to maintain the state and integrity of your database, and 100 connections are available for you and your applications.

Is Postgres multithreaded?

The PostgreSQL server is process-based (not threaded). Each database session connects to a single PostgreSQL operating system (OS) process. Multiple sessions are automatically spread across all available CPUs by the OS. The OS also uses CPUs to handle disk I/O and run other non-database tasks.


1 Answers

PostgreSQL versions older than 10 execute each query in a single backend, which is a process with a single thread. It cannot use more than one CPU for a query. It's also somewhat limited in what I/O concurrency it can achieve within a single query, really only doing concurrent I/O for bitmap index scans and otherwise relying on the OS and disk system for concurrent I/O.

PostgreSQL 10+ support parallel query. At time of writing (PostgreSQL 12 release) parallel query is only used for read-only queries. Parallel query support enables considerably more parallelism for some types of query.

Pg is good at concurrent loads of many smaller queries and it's easy to saturate your system that way. It just isn't as good at making the best of system resources for one or two really big queries, though this is improving as parallel query support is added for more types of query.


If you're on an older PostgreSQL without parallel query, or your query doesn't benefit from parallel query support yet:

What you can do is split the job up into chunks and hand them out to workers. You've alluded to this with:

Can i modify query to get postgre to calculate GetStatistic in paralel for different rows simultaneously, using all avaliable CPUs?

There are a variety of tools, like DBlink, PL/Proxy, pgbouncer and PgPool-II that are designed to help with this kind of job. Alternately, you can just do it yourself, starting (say) 8 workers that each connect to the database and do UPDATE ... WHERE id BETWEEN ? AND ? statements with non-overlapping ID ranges. A more sophisticated option is to have a queue controller hand out ranges of about say 1000 IDs to workers that UPDATE that range then ask for a new one.

Note that 64 CPUs doesn't mean that 64 concurrent workers is ideal. Your disk I/O is a factor too when it comes to writes. You can help your I/O costs a bit if you set your UPDATE transactions to use a commit_delay and (if safe for your business requirements for this data) synchronous_commit = 'off' then the load from syncs should be reduced significantly. Nonetheless, it' likely that best throughput will be achieved well below 64 concurrent workers.

It's highly likely that your GetStatistic function can be made a lot faster by converting it to an inlineable SQL function or view, rather than what's presumably a loop-heavy procedural PL/pgSQL function it is at the moment. It might help if you showed this function.

like image 199
Craig Ringer Avatar answered Nov 16 '22 01:11

Craig Ringer