I have two huge tables: <pre class="prettyprint"><code>Table "public.tx_input1_new" (100,000,000 rows) Column | Type | Modifiers ----------------|-----------------------------|---------- blk_hash | character varying(500) | blk_time | timestamp without time zone | tx_hash | character varying(500) | input_tx_hash | character varying(100) | input_tx_index | smallint | input_addr | character varying(500) | input_val | numeric | Indexes: "tx_input1_new_h" btree (input_tx_hash, input_tx_index) </code></pre> <hr> <pre class="prettyprint"><code>Table "public.tx_output1_new" (100,000,000 rows) Column | Type | Modifiers --------------+------------------------+----------- tx_hash | character varying(100) | output_addr | character varying(500) | output_index | smallint | input_val | numeric | Indexes: "tx_output1_new_h" btree (tx_hash, output_index) </code></pre> I want to update table1 by the other table: <pre class="prettyprint"><code>UPDATE tx_input1 as i SET input_addr = o.output_addr, input_val = o.output_val FROM tx_output1 as o WHERE i.input_tx_hash = o.tx_hash AND i.input_tx_index = o.output_index; </code></pre> Before I execute this SQL command, I already created the index for this two table: <pre class="prettyprint"><code>CREATE INDEX tx_input1_new_h ON tx_input1_new (input_tx_hash, input_tx_index); CREATE INDEX tx_output1_new_h ON tx_output1_new (tx_hash, output_index); </code></pre> I use <code>EXPLAIN</code> command to see the query plan, but it didn't use the index I created. It took about 14-15 hours to complete this <code>UPDATE</code>. What is the problem within it? How can I shorten the execution time, or tune my database/table? Thank you.

Since you are joining two large tables and there are no conditions that could filter out rows, the only efficient join strategy will be a hash join, and no index can help with that. First there will be a sequential scan of one of the tables, from which a hash structure is built, then there will be a sequential scan over the other table, and the hash will be probed for each row found. How could any index help with that? You can expect such an operation to take a long time, but there are some ways in which you could speed up the operation: <ul> <li>Remove all indexes and constraints on <code>tx_input1</code> before you begin. Your query is one of the examples where an index does not help at all, but actually hurts performance, because the indexes have to be updated along with the table. Recreate the indexes and constraints after you are done with the <code>UPDATE</code>. Depending on the number of indexes on the table, you can expect a decent to massive performance gain.</li> <li>Increase the <code>work_mem</code> parameter for this one operation with the <code>SET</code> command as high as you can. The more memory the hash operation can use, the faster it will be. With a table that big you'll probably still end up having temporary files, but you can still expect a decent performance gain.</li> <li>Increase <code>checkpoint_segments</code> (or <code>max_wal_size</code> from version 9.6 on) to a high value so that there are fewer checkpoints during the <code>UPDATE</code> operation.</li> <li>Make sure that the table statistics on both tables are accurate, so that PostgreSQL can come up with a good estimate for the number of hash buckets to create.</li> </ul> After the <code>UPDATE</code>, if it affects a big number of rows, you might consider to run <code>VACUUM (FULL)</code> on <code>tx_input1</code> to get rid of the resulting table bloat. This will lock the table for a longer time, so do it during a maintenance window. It will reduce the size of the table and as a consequence speed up sequential scans.

Postgresql - How to speed up for updating huge table(100 million rows)?

Tags:

performance

postgresql

sql-update

sql-execution-plan

I have two huge tables:

Table "public.tx_input1_new" (100,000,000 rows) 

     Column     |            Type             | Modifiers
----------------|-----------------------------|----------
 blk_hash       | character varying(500)      |
 blk_time       | timestamp without time zone |
 tx_hash        | character varying(500)      |
 input_tx_hash  | character varying(100)      |
 input_tx_index | smallint                    |
 input_addr     | character varying(500)      |
 input_val      | numeric                     |

Indexes:
    "tx_input1_new_h" btree (input_tx_hash, input_tx_index)

Table "public.tx_output1_new" (100,000,000 rows)

    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 tx_hash      | character varying(100) |
 output_addr  | character varying(500) |
 output_index | smallint               |
 input_val    | numeric                |

Indexes:
    "tx_output1_new_h" btree (tx_hash, output_index)

I want to update table1 by the other table:

UPDATE tx_input1 as i
SET 
  input_addr = o.output_addr,
  input_val = o.output_val
FROM tx_output1 as o
WHERE 
  i.input_tx_hash = o.tx_hash
  AND i.input_tx_index = o.output_index;

Before I execute this SQL command, I already created the index for this two table:

CREATE INDEX tx_input1_new_h ON tx_input1_new (input_tx_hash, input_tx_index);

CREATE INDEX tx_output1_new_h ON tx_output1_new (tx_hash, output_index);

I use EXPLAIN command to see the query plan, but it didn't use the index I created.

It took about 14-15 hours to complete this UPDATE.

What is the problem within it?

How can I shorten the execution time, or tune my database/table?

Thank you.

961

asked Mar 02 '17 07:03

user3383856

1 Answers

Since you are joining two large tables and there are no conditions that could filter out rows, the only efficient join strategy will be a hash join, and no index can help with that.

First there will be a sequential scan of one of the tables, from which a hash structure is built, then there will be a sequential scan over the other table, and the hash will be probed for each row found. How could any index help with that?

You can expect such an operation to take a long time, but there are some ways in which you could speed up the operation:

Remove all indexes and constraints on tx_input1 before you begin. Your query is one of the examples where an index does not help at all, but actually hurts performance, because the indexes have to be updated along with the table. Recreate the indexes and constraints after you are done with the UPDATE. Depending on the number of indexes on the table, you can expect a decent to massive performance gain.
Increase the work_mem parameter for this one operation with the SET command as high as you can. The more memory the hash operation can use, the faster it will be. With a table that big you'll probably still end up having temporary files, but you can still expect a decent performance gain.
Increase checkpoint_segments (or max_wal_size from version 9.6 on) to a high value so that there are fewer checkpoints during the UPDATE operation.
Make sure that the table statistics on both tables are accurate, so that PostgreSQL can come up with a good estimate for the number of hash buckets to create.

After the UPDATE, if it affects a big number of rows, you might consider to run VACUUM (FULL) on tx_input1 to get rid of the resulting table bloat. This will lock the table for a longer time, so do it during a maintenance window. It will reduce the size of the table and as a consequence speed up sequential scans.

answered Sep 28 '22 16:09

Laurenz Albe

Related questions
                            
                                Fastest way to search a number in a list of ranges
                            
                                Hex To String in Java Performance is too slow
                            
                                groovy 'switch' vs. 'if' performance
                            
                                Improve Rails loading time
                            
                                Does DLL size matter?
                            
                                Writing to file using StreamWriter much slower than file copy over slow network
                            
                                Performance impact of DefaultTraceListener
                            
                                Why does Java have much better performance vs other interpreted languages? [closed]
                            
                                speeding up JSON parsing in Perl
                            
                                What is the cost of try catch blocks?
                            
                                Performance: SortedDictionary vs SortedSet
                            
                                Extremely fast method for modular exponentiation with modulus and exponent of several million digits
                            
                                MongoDB / Mongoose Schema for Threaded Messages (Efficiently)
                            
                                Why Javascript ===/== string equality sometimes has constant time complexity and sometimes has linear time complexity?
                            
                                Why is numpy.power slower for integer exponents?
                            
                                spark streaming throughput monitoring
                            
                                Why list comprehension is much faster than numpy for multiplying arrays?
                            
                                What is better for the performance CollectionUtils.isEmpty() or collection.isEmpty()
                            
                                What would be the time complexity of the pascal triangle algorithm
                            
                                React: how to pass arguments to the callback

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With