Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up a slow UPDATE query

Tags:

sql

postgresql

I have the following UPDATE query:

UPDATE Indexer.Pages SET LastError=NULL where LastError is not null;

Right now, this query takes about 93 minutes to complete. I'd like to find ways to make this a bit faster.

The Indexer.Pages table has around 506,000 rows, and about 490,000 of them contain a value for LastError, so I doubt I can take advantage of any indexes here.

The table (when uncompressed) has about 46 gigs of data in it, however the majority of that data is in a text field called html. I believe simply loading and unloading that many pages is causing the slowdown. One idea would be to make a new table with just the Id and the html field, and keep Indexer.Pages as small as possible. However, testing this theory would be a decent amount of work since I actually don't have the hard disk space to create a copy of the table. I'd have to copy it over to another machine, drop the table, then copy the data back which would probably take all evening.

Ideas? I'm using Postgres 9.0.0.

UPDATE:

Here's the schema:

CREATE TABLE indexer.pages
(
  id uuid NOT NULL,
  url character varying(1024) NOT NULL,
  firstcrawled timestamp with time zone NOT NULL,
  lastcrawled timestamp with time zone NOT NULL,
  recipeid uuid,
  html text NOT NULL,
  lasterror character varying(1024),
  missingings smallint,
  CONSTRAINT pages_pkey PRIMARY KEY (id ),
  CONSTRAINT indexer_pages_uniqueurl UNIQUE (url )
);

I also have two indexes:

CREATE INDEX idx_indexer_pages_missingings
  ON indexer.pages
  USING btree
  (missingings )
  WHERE missingings > 0;

and

CREATE INDEX idx_indexer_pages_null
  ON indexer.pages
  USING btree
  (recipeid )
  WHERE NULL::boolean;

There are no triggers on this table, and there is one other table that has a FK constraint on Pages.PageId.

like image 723
Mike Christensen Avatar asked Jun 18 '12 16:06

Mike Christensen


People also ask

How do you speed up a slow query?

The way to make a query run faster is to reduce the number of calculations that the software (and therefore hardware) must perform. To do this, you'll need some understanding of how SQL actually makes calculations.

What to do if a query is slow?

If you are running a query and it's slow, it could benefit from indexes. Indexes are often the most useful when created on columns as part of the Join clause and the Where clause. So, check which columns are used there, check if they have indexes, and create them if they don't.


1 Answers

What @kgrittn posted as comment is the best answer so far. I am merely filling in details.

Before you do anything else, you should upgrade PostgreSQL to a current version, at least to the last security release of your major version. See guidelines on the project.

I also want to stress what Kevin mentioned about indexes involving the column LastError. Normally, HOT updates can recycle dead rows on a data page and make UPDATEs a lot faster - effectively removing (most of) the need for vacuuming. Related:

  • Redundant data in update statements

If your column is used in any index in any way, HOT UPDATEs are disabled, because it would break the index(es). If that is the case, you should be able to speed up the query a lot by deleting all of these indexes before you UPDATE and recreate them later.

In this context it would help to run multiple smaller UPDATEs: If ...
... the updated column is not involved in any indexes (enabling HOT updates). ... the UPDATE is easily divided into multiple patches in multiple transactions. ... the rows in those patches are spread out over the table (physically, not logically). ... there are no other concurrent transactions keeping dead tuples from being reused.

Then you would not need to VACCUUM in between multiple patches, because HOT updates can reuse dead tuples directly - only dead tuples from previous transactions, not from the same or concurrent ones. You may want to schedule a VACUUM at the end of the operation, or just let auto-vacuuming do its job.

The same could be done with any other index that is not needed for the UPDATE - and judging from your numbers the UPDATE is not going to use an index anyway. If you update large parts of your table, building new indexes from scratch is much faster than incrementally updating indexes with every changed row.

Also, your update is not likely to break any foreign key constraints. You could try to delete & recreate those, too. This does open a time slot where referential integrity would not be enforced. If the integrity is violated during the UPDATE you get an error when trying to recreate the FK. If you do it all within one transaction, concurrent transactions never get to see the dropped FK, but you take a write lock on the table - same as with dropping / recreating indexes or triggers)

Lastly, disable & enable triggers that are not needed for the update.

Be sure to do all of this in one transaction. Maybe do it in a number of smaller patches, so it does not block concurrent operations for too long.

So:

BEGIN;
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- UPDATE ...
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
COMMIT;

You cannot run VACUUM inside a transaction block. Per documentation:

VACUUM cannot be executed inside a transaction block.

You could split your operation into a few big chunks and run in between:

VACUUM ANALYZE tbl;

If you don't have to deal with concurrent transactions you could (even more effectively):

ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)

-- Multiple UPDATEs with logical slices of the table
-- each slice in its own transaction.
-- VACUUM ANALYZE tbl;  -- optionally in between, or autovacuum kicks in

-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
like image 111
Erwin Brandstetter Avatar answered Sep 28 '22 11:09

Erwin Brandstetter