Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Long UPDATE in postgresql

I have been running an UPDATE on a table containing 250 million rows with 3 index'; this UPDATE uses another table containing 30 million rows. It has been running for about 36 hours now. I am wondering if their is a way to find out how close it is to being done for if it plans to take a million days to do its thing, I will kill it; yet if it only needs another day or two, I will let it run. Here is the command-query:

UPDATE pagelinks SET pl_to = page_id
    FROM page
    WHERE 
        (pl_namespace, pl_title) = (page_namespace, page_title)
        AND
        page_is_redirect = 0
;

The EXPLAIN is not the issue here and I only mention the big table's having multiple indexes in order to somewhat justify how long it takes to UPDATE it. But here is the EXPLAIN anyway:

Merge Join  (cost=127710692.21..135714045.43 rows=452882848 width=57)
  Merge Cond: (("outer".page_namespace = "inner".pl_namespace) AND ("outer"."?column4?" = "inner"."?column5?"))
  ->  Sort  (cost=3193335.39..3219544.38 rows=10483593 width=41)
        Sort Key: page.page_namespace, (page.page_title)::text
        ->  Seq Scan on page  (cost=0.00..439678.01 rows=10483593 width=41)
              Filter: (page_is_redirect = 0::numeric)
  ->  Sort  (cost=124517356.82..125285665.74 rows=307323566 width=46)
        Sort Key: pagelinks.pl_namespace, (pagelinks.pl_title)::text"
        ->  Seq Scan on pagelinks  (cost=0.00..6169460.66 rows=307323566 width=46)

Now I also sent a parallel query-command in order to DROP one of pagelinks' indexes; of course it is waiting for the UPDATE to finish (but I felt like trying it anyway!). Hence, I cannot SELECT anything from pagelinks for fear of corrupting the data (unless you think it would be safe to kill the DROP INDEX postmaster process?).

So I am wondering if their is a table that would keep track of the amount of dead tuples or something for It would be nice to know how fast or how far the UPDATE is in the completion of its task.

Thx (PostgreSQL is not as intelligent as I thought; it needs heuristics)

like image 708
Nicholas Leonard Avatar asked Jan 07 '09 20:01

Nicholas Leonard


People also ask

How do you update columns in PostgreSQL?

First, specify the name of the table that you want to update data after the UPDATE keyword. Second, specify columns and their new values after SET keyword. The columns that do not appear in the SET clause retain their original values. Third, determine which rows to update in the condition of the WHERE clause.


2 Answers

Did you read the PostgreSQL documentation for "Using EXPLAIN", to interpret the output you're showing?

I'm not a regular PostgreSQL user, but I just read that doc, and then compared to the EXPLAIN output you're showing. Your UPDATE query seems to be using no indexes, and it's forced to do table-scans to sort both page and pagelinks. The sort is no doubt large enough to need temporary disk files, which I think are created under your temp_tablespace.

Then I see the estimated database pages read. The top-level of that EXPLAIN output says (cost=127710692.21..135714045.43). The units here are in disk I/O accesses. So it's going to access the disk over 135 million times to do this UPDATE.

Note that even 10,000rpm disks with 5ms seek time can achieve at best 200 I/O operations per second under optimal conditions. This would mean that your UPDATE would take 188 hours (7.8 days) of disk I/O, even if you could sustain saturated disk I/O for that period (i.e. continuous reads/writes with no breaks). This is impossible, and I'd expect the actual throughput to be off by at least an order of magnitude, especially since you have no doubt been using this server for all sorts of other work in the meantime. So I'd guess you're only a fraction of the way through your UPDATE.

If it were me, I would have killed this query on the first day, and found another way of performing the UPDATE that made better use of indexes and didn't require on-disk sorting. You probably can't do it in a single SQL statement.

As for your DROP INDEX, I would guess it's simply blocking, waiting for exclusive access to the table, and while it's in this state I think you can probably kill it.

like image 103
Bill Karwin Avatar answered Oct 17 '22 01:10

Bill Karwin


This is very old, but if you want a way for you to monitore your update... Remember that sequences are affected globally, so you just can create one to monitore this update in another session by doing this:

create sequence yourprogress; 

UPDATE pagelinks SET pl_to = page_id
    FROM page
    WHERE 
        (pl_namespace, pl_title) = (page_namespace, page_title)
        AND
        page_is_redirect = 0 AND NEXTVAL('yourprogress')!=0;

Then in another session just do this (don't worry about transactions, as sequences are affected globally):

select last_value from yourprogress;

This will show how many lines are being affected, so you can estimate how long you will take.

At just end restart your sequence to do another try:

alter sequence yourprogress restart with 1;

Or just drop it:

drop sequence yourprogress;
like image 44
Luciano Andress Martini Avatar answered Oct 17 '22 01:10

Luciano Andress Martini