Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delayed indexing in postgres

Tags:

postgresql

We have a system which stores data in a postgres database. In some cases, the size of the database has grown to several GBs.

When this system is upgraded, the data in the said database is backed up, and finally it's restored in the database. Owing to the huge amounts of data, the indexing takes a long time to complete (~30 minutes) during restoration, thereby delaying the upgrade process.

Is there a way where the data copy and indexing can be split into two steps, where the data is copied first to complete the upgrade, followed by indexing which can be done at a later time in the background?

Thanks!

like image 540
Maddy Avatar asked Oct 24 '17 04:10

Maddy


People also ask

Why is index scan slow?

The Index Lookup We already explained that (1) will always be fast. That means that if your index lookup is slow then that means that either your leaf node chain is very long, or your database needs to perform many table access operations.

What is indexing in PostgreSQL?

Indexes are a common way to enhance database performance. An index allows the database server to find and retrieve specific rows much faster than it could do without an index. But indexes also add overhead to the database system as a whole, so they should be used sensibly.

What is gin index in PostgreSQL?

GIN stands for Generalized Inverted Index. GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items.

What is secondary index in PostgreSQL?

All indexes in PostgreSQL are secondary indexes, meaning that each index is stored separately from the table's main data area (which is called the table's heap in PostgreSQL terminology). This means that in an ordinary index scan, each row retrieval requires fetching data from both the index and the heap.

What is indexing in PostgreSQL?

PostgreSQL - INDEXES. Indexes are special lookup tables that the database search engine can use to speed up data retrieval. Simply put, an index is a pointer to data in a table. An index in a database is very similar to an index in the back of a book. For example, if you want to reference all pages in a book that discusses a certain topic, ...

What are the disadvantages of rebuilding an index in PostgreSQL?

Rebuilding an index can interfere with regular operation of a database. Normally PostgreSQL locks the table whose index is rebuilt against writes and performs the entire index build with a single scan of the table.

How do I drop an index in PostgreSQL?

Introduction to PostgreSQL DROP INDEX statement. 1 index_name. You specify the name of the index that you want to remove after the DROP INDEX clause. 2 IF EXISTS. 3 CASCADE. 4 RESTRICT. 5 CONCURRENTLY.

How to remove a non-existing index in PostgreSQL?

You specify the name of the index that you want to remove after the DROP INDEX clause. Attempting to remove a non-existent index will result in an error. To avoid this, you can use the IF EXISTS option. In case you remove a non-existent index with IF EXISTS, PostgreSQL issues a notice instead.


2 Answers

There's no built-in way to do it with pg_dump and pg_restore. But pg_restore's -j option helps a lot.

There is CREATE INDEX CONCURRENTLY. But pg_restore doesn't use it.

It would be quite nice to be able to restore everything except secondary indexes not depended on by FK constraints. Then restore those as a separate phase using CREATE INDEX CONCURRENTLY. But no such support currently exists, you'd have to write it yourself.

You can, however, filter the table-of-contents used by pg_restore, so you could possibly do some hacky scripting to do the needed work.

like image 86
Craig Ringer Avatar answered Oct 07 '22 07:10

Craig Ringer


There is an option to separate the data and creating index in postgresql while taking pg_dump. Here pre-data refers to Schema, post-data refers to index and triggers.

From the docs,

--section=sectionname Only dump the named section. The section name can be pre-data, data, or post-data. This option can be specified more than once to select multiple sections. The default is to dump all sections.

The data section contains actual table data, large-object contents, and sequence values. Post-data items include definitions of indexes, triggers, rules, and constraints other than validated check constraints. Pre-data items include all other data definition items.

May be this would help :)

like image 32
krithikaGopalakrisnan Avatar answered Oct 07 '22 08:10

krithikaGopalakrisnan