Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Re-indexing huge database (the English Wikipedia) efficiently

THE GIST

Before performing a massive 40+ GB import of the English Wikipedia, I had to temporarily remove indexes and auto-increment fields from three tables ('page', 'revision', and 'text') to handle the load. Now I have finally successfully imported the English Wikipedia to my local machine and created a local mirror (MediaWiki API). Yay!

However, I now need to re-create the indexes and auto-increment fields in less than a decade. Luckily, (1) I took plenty of screen-shots of the relevant tables in phpmyadmin before I removed the indexes and fields; (2) I can explain with extreme precision the steps I took before the import; and (3) this shouldn't be too difficult for anyone fluent in MySQL. Unfortunately, I have no expertise in MySQL whatsoever, so "baby steps" explanations would be extremely helpful.

PRECISELY WHAT I DID (PREPARING FOR THE IMPORT):

Steps 1, 2, 3: This image depicts the table page before I modified the field page_id by clicking 'Change' and un-checking 'Auto-Increment' (in preparation for importing). I performed precisely the same modifications for the field rev_id in table revision and old_id in table text but omitted the screen-shots to avoid redundancy.

table 'page' before modification of 'page_id'

Step 4: This image depicts the indexes for the table page before I dropped all of them.

indexes for table 'page' before I dropped them

Step 5: This image depicts the indexes for the table revision before I dropped all of them.

indexes for table 'revision' before I dropped them

Step 6: This image depicts the indexes for the table text before I dropped all of them.

indexes for table 'text' before I dropped them

WHAT I NEED NOW (RESTORING AFTER THE IMPORT):

I just need to restore the original indexes and auto-increment fields without waiting a hundred years.

Set-up details: PHP 5.3.8 (apache2handler), MySQL 5.5.16 (InnoDB), Apache 2.2.21, Ubuntu 12.04 LTS, MediaWiki 1.19.0 (private wiki)

like image 740
Brian Schmitz Avatar asked Jun 06 '12 21:06

Brian Schmitz


1 Answers

I really like Wikipedia so I'll try to help.

You need to use a lot of

ALTER TABLE

Add primary keys

ALTER TABLE page ADD PRIMARY KEY (page_id);
ALTER TABLE revision ADD PRIMARY KEY (rev_id);
ALTER TABLE text ADD PRIMARY KEY (old_id);

Add auto increments back

ALTER TABLE page MODIFY COLUMN page_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;

I need the table descriptions for all tables before continuing. If rev_id and old_id are same definitions as page_id then:

ALTER TABLE revision MODIFY COLUMN rev_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;
ALTER TABLE text MODIFY COLUMN old_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;

Add unique keys

ALTER TABLE page ADD UNIQUE name_title(page_namespace, page_title);
ALTER TABLE revision ADD UNIQUE rev_page_id(rev_page, rev_id);

Other indexes

ALTER TABLE page ADD INDEX page_random(page_random);
ALTER TABLE page ADD INDEX page_len(page_len);
ALTER TABLE page ADD INDEX page_redirect_namespace(page_is_redirect, page_namespace, page_len);
ALTER TABLE revision ADD INDEX rev_timestamp(rev_timestamp);
ALTER TABLE revision ADD INDEX page_timestamp(rev_page, rev_timestamp);
ALTER TABLE revision ADD INDEX user_timestamp(rev_user, rev_timestamp);
ALTER TABLE revision ADD INDEX user_text_timestamp(rev_user_text, rev_timestamp);

Again, there may be column definitions that change this stuff. You need to provide the CREATE TABLE info.

like image 100
saccharine Avatar answered Oct 19 '22 02:10

saccharine