Update VERY LARGE PostgreSQL database table efficiently

Tags:

I have a very large database table in PostgresQL and a column like "copied". Every new row starts uncopied and will later be replicated to another thing by a background programm. There is an partial index on that table "btree(ID) WHERE replicated=0". The background programm does a select for at most 2000 entries (LIMIT 2000), works on them and then commits the changes in one transaction using 2000 prepared sql-commands.

Now the problem ist that I want to give the user an option to reset this replicated-value, make it all zero again.

An update table set replicated=0;

is not possible:

It takes very much time
It duplicates the size of the tabel because of MVCC
It is done in one transaction: It either fails or goes through.

I actually don't need transaction-features for this case: If the system goes down, it shall process only parts of it.

Several other problems: Doing an

update set replicated=0 where id >10000 and id<20000

is also bad: It does a sequential scan all over the whole table which is too slow. If it weren't doing that, it would still be slow because it would be too many seeks.

What I really need is a way of going through all rows, changing them and not being bound to a giant transaction.

Strangely, an

UPDATE table 
  SET replicated=0 
WHERE ID in (SELECT id from table WHERE replicated= LIMIT 10000)

is also slow, although it should be a good thing: Go through the table in DISK-order...

(Note that in that case there was also an index that covered this)

(An update LIMIT like Mysql is unavailable for PostgresQL)

BTW: The real problem is more complicated and we're talking about an embedded system here that is already deployed, so remote schema changes are difficult, but possible It's PostgresQL 7.4 unfortunately.

The amount of rows I'm talking about is e.g. 90000000. The size of the databse can be several dozend gigabytes.

The database itself only contains 5 tables, one is a very large one. But that is not bad design, because these embedded boxes only operate with one kind of entity, it's not an ERP-system or something like that!

Any ideas?

540

asked Sep 21 '08 21:09

Christian

1 Answers

How about adding a new table to store this replicated value (and a primary key to link each record to the main table). Then you simply add a record for every replicated item, and delete records to remove the replicated flag. (Or maybe the other way around - a record for every non-replicated record, depending on which is the common case).

That would also simplify the case when you want to set them all back to 0, as you can just truncate the table (which zeroes the table size on disk, you don't even have to vacuum to free up the space)

answered Sep 21 '22 15:09

Dan

Related questions
                            
                                Does a version control database storage engine exist?
                            
                                Speeding up the rate that IIS/.NET/LINQ retrieves data from the Network Buffers
                            
                                Converting Varchar to NVarchar?
                            
                                Selecting max() of multiple columns
                            
                                h2 sql, create table with multi-column primary key?
                            
                                Understanding clustered index
                            
                                Custom SERIAL / autoincrement per group of values
                            
                                How can I use SqlBulkCopy with binary data (byte[]) in a DataTable?
                            
                                "Create table if not exists" - how to check the schema, too?
                            
                                Max Row Size in SQL Server 2012 with varchar(max) fields
                            
                                Execute multiple semi-colon separated query using mysql Prepared Statement
                            
                                PostgreSQL function definition in SQuirreL: unterminated dollar-quoted string
                            
                                What is the difference when comparing with parentheses: WHERE (a, b)=(1,2)
                            
                                What is the difference between setting statement fetch size in JDBC or firing a SQL query with LIMIT clause?
                            
                                testing inequality with columns that can be null
                            
                                SQL Server Query log for failed/incorrect queries?
                            
                                How to check if a column is being updated in an INSTEAD OF UPDATE Trigger
                            
                                Calling Oracle stored procedure with output parameter from SQL Server
                            
                                Dynamic SQL Parameters with Anorm and Scala Play Framework
                            
                                How to fix "Only one expression can be specified in the select list when the subquery is not introduced with EXISTS" error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Update VERY LARGE PostgreSQL database table efficiently

Tags:

sql

database

postgresql

sql-update

mvcc

Christian

People also ask

1 Answers

Dan

Recent Activity

Donate For Us