Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is adding and dropping indexes everyday on huge tables a good practice?

I'm building a Web Application that is connected to a MySQL database. I've got two huge tables containing each about 40 millions rows at the moment, and they are receiving new rows everyday (which adds ~ 500 000-1000 000 rows everyday).

The process to add new rows runs during the night, while no one can use the application, and the new rows' content depends on the result of some basic SELECT queries on the current database. In order to get the result of those SELECT statement fast enough, I'm using simple indexes (one column per index) on each column that appears at least once in a WHERE clause.

The thing is, during the day, some totally different queries are run against those tables, including some "range WHERE clause" (SELECT * FROM t1 WHERE a = a1 AND b = b1 AND (date BETWEEN d1 AND d2)). I found on stack this very helpful mini-cookbook that advises you on which INDEXes you should use depending on how the database is queried: http://mysql.rjweb.org/doc.php/index_cookbook_mysql They advice to use compound index: in my example query above it would give INDEX(a, b, date).

It indeed increased the speed of the queries run during the day (from 1 minute to 8 seconds so I was truly happy).

However, with those compound indexes, the required time to add new rows during the night totally explode (it would take more than one day to add the daily content).

Here is my question: would that be ok to drop all the indexes every night, add the new content, and set back up the daily indexes? Or would that be dangerous since indexes are not meant to be rebuilt every day, especially on such big tables? I know such an operation would take approximately two hours in total (drop and recreate INDEXes).

I am aware of the existence of ALTER TABLE table_name DISABLE KEYS; but I'm using InnoDB and I believe it is not made to work on InnoDB table.

like image 535
Tchopane Avatar asked Nov 23 '16 13:11

Tchopane


2 Answers

I believe you have answered your own question: You need the indexes during the day, but not at night. Given what you describe, you should drop the indexes for the bulk inserts at night and re-create them afterwards. Dropping indexes for data loads is not unheard of, and seems appropriate in your case.

I would ask about how you are inserting new data. One method is to insert the values one row at a time. Another is to put the values into a temporary table (with no index) and do a bulk insert:

insert into bigtable( . . .)
    select . . .
    from smalltable;

These have different performance characteristics. You might find that using a single insert (if you are not already doing so) is fast enough for your purposes.

like image 193
Gordon Linoff Avatar answered Oct 22 '22 07:10

Gordon Linoff


A digression... PARTITIONing by date should be very useful for you since you are deleting things over a year ago. I would recommend PARTITION BY RANGE(TO_DAYS(...)) and breaking it into 14 or 54 partitions (months or weeks, plus some overhead). This will eliminate the time it takes to delete the old rows, since DROP PARTITION is almost instantaneous.

More details are in my partition blog. Your situation sounds like both Use case #1 and Use case #3.

But back to your clever idea of dropping and rebuilding indexes. To others, I point out the caveat that you have the luxury of not otherwise touching the table for long enough to do the rebuild.

With PARTITIONing, all the rows being inserted will go into the 'latest' partition, correct? This partition is a lot smaller than the entire table, so there is a better chance that the indexes will fit in RAM, thereby be 10 times as fast to update (without rebuilding the indexes). If you provide SHOW CREATE TABLE, SHOW TABLE STATUS, innodb_buffer_pool_size, and RAM size, I can help you do the arithmetic to see if your 'last' partition will fit in RAM.

A note about index updates in InnoDB -- they are 'delayed' by sitting in the "Change buffer", which is a portion of the buffer_pool. See innodb_change_buffer_size_max, available since 5.6. Are you using that version, or newer? (If not, you ought to upgrade, for many reasons.)

The default for that setting is 25, meaning that 25% of the buffer_pool is set aside for pending updates to indexes, as caused by INSERT, etc. That acts like a "cache", such that multiple updates to the same index block are held there until they get bumped out. A higher setting should make index updates hit the disk less often, hence finish faster.

Where I am heading with this... By increasing this setting, you would make the inserts (direct, not rebuild) more efficient. I'm thinking that this might speed it up:

Just before the nightly INSERTs:

innodb_change_buffer_size_max = 70
innodb_old_blocks_pct = 10

Soon after the nightly INSERTs:

innodb_change_buffer_size_max = 25
innodb_old_blocks_pct = 37

(I am not sure about that other setting, but it seems reasonable to push it out of the way.)

Meanwhile, what is the setting of innodb_buffer_pool_size? Typically, it should be 70% of available RAM.

In a similar application, I had big, hourly, dumps to load into a table, and a 90-day retention. I stretched my Partition rules by having 90 daily partitions and 24 hourly partitions. Every night, I spent a lot of time (but less than an hour) doing REORGANIZE PARTITION to turn the 24 hourly partitions into a new daily (and dropping the 90-day-old partition). During each hour, the load had the added advantage that nothing else was touching the 1-hour partition -- I could do normalization, summarization, and loading all in 7 minutes. The entire 90 days fit in 400GB. (Side note: a large number of partitions is a performance killer until 8.0; so don't even consider daily partitions for you 1-year retention.)

The Summary tables made so that 50-minute queries (in the prototype) shrank to only 2 seconds. Perhaps you need a summary table with PRIMARY KEY (a, b, date)? That will let you get rid of such an index on the 'Fact' table. Oops, that eliminates the entire premise of your original question ! See the links at the bottom of my blogs; look for "Summary Tables". A general rule: Don't have any indexes (other than the PRIMARY KEY) on the Fact table; use Summary tables for things that need messier indexes.

like image 30
Rick James Avatar answered Oct 22 '22 06:10

Rick James