Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issues with wikipedia dump table pagelinks

I downloaded the enwiki-latest-pagelinks.sql.gz dump from dumps.wikimedia.org/enwiki/latest/.

I upacked the file, its uncompressed size is 37G.

The table structure is this:

SHOW CREATE TABLE wp_dump.pagelinks;

CREATE TABLE `pagelinks` (
  `pl_from` int(8) unsigned NOT NULL DEFAULT '0',
  `pl_namespace` int(11) NOT NULL DEFAULT '0',
  `pl_title` varbinary(255) NOT NULL DEFAULT '',
  `pl_from_namespace` int(11) NOT NULL DEFAULT '0',
  UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
  KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`),
  KEY `pl_backlinks_namespace` (`pl_from_namespace`,`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary

I imported the table into a new, empty database:

mysql -D wp_dump -u root -p < enwiki-latest-pagelinks.sql

The computer I am running the task on has 16G of RAM and the mysql database is located on a SSD, so I was assuming that despite the table's size the import would not take too long.

However, the task is running since over a day and still running. There are no other processes accessing mysql and there is no workload on the computer.

The database file itself now is 79G large.

ls -lh

-rw-r----- 1 mysql mysql   65 May 11 17:40 db.opt
-rw-r----- 1 mysql mysql 8,6K May 12 07:06 pagelinks.frm
-rw-r----- 1 mysql mysql  79G May 13 16:59 pagelinks.ibd

The table now has over 500 million rows.

SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'wp_dump';

+------------+------------+
| table_name | table_rows |
+------------+------------+
| pagelinks  |  520919860 |
+------------+------------+

I am wondering:

Is the enwiki-latest-pagelinks.sql really over 79G large?

Does pagelinks really contain over 500 million rows?

Does it really take that long to import the pagelinks table?

Can you provide some metrics the expected table size and the row amount, please?

Update: 14th may, 2017:

insert still running; pagelinks.ibdfile now 130G; number of rows now almost 700 million

Update: 16th may, 2017:

insert still running; pagelinks.ibdfile now 204G; number of rows now over 1.2 billion

I calculated the rows inserted per second over the last two days:

rows/sek = 3236

And: It is many thousand inserts per insert statement in the sql script (head -41 enwiki-latest-pagelinks.sql | tail -1 | grep -o "(" | wc -l is 30471)

So, my follow-up / modified questions:

Is the number of rows and the idb file size to be expected given the sql file size of 37G and the table structure (as listed above)?

Is rows/sek = 3236 a good value (meaning that it takes a few days to insert the table)?

What may be the limiting speed factor / how can I speed up the import?

  • Disable the indexes (and calculate them after the insert)?
  • Optimize transactions (commit (nothing set in script) / autocommit (now ON))?
  • Optimize variable settings (e.g. innodb_buffer_pool_size, now 134217728)?
like image 488
Sim Betren Avatar asked May 13 '17 15:05

Sim Betren


Video Answer


1 Answers

@Sim Betren: I was currently importing the same table, I can get about 7700 rows/s. Which means about 600.000.000 rows a day. Probably the most important thing is to get the right settings on InnoDB:

https://dba.stackexchange.com/questions/83125/mysql-any-way-to-import-a-huge-32-gb-sql-dump-faster

innodb_buffer_pool_size = 4G
innodb_log_buffer_size = 256M
innodb_log_file_size = 1G
innodb_write_io_threads = 16
innodb_flush_log_at_trx_commit = 0

Those settings work good. From what I read and tried, InnoDB loves high memory settings. Ideally, one would use a 16Gb or even 32Gb machine, then increase these settings even more. But I got 7700 rows/s on a modest setup, that's almost 10 years old:

  • Intel Q6700 quad
  • 8 Gb DDR2 memory

I combined that 10 year old hardware with a 2017 model 500Gb SSD, which is dedicated to the job and handles both the reading and the writing. the reason for using the older hardware is that the SSD is the most important part of the setup (because of IOPS). Plus by using older hardware I saved a bit of money. However, the hardware is limited to 8Gb of DDR2. A newer dedicated machine with 32Gb or 64Gb internal memory could really fly I reckon.

Software setup:

  • Linux Mint 64bit
  • MySQL Server 5.7.18 for Ubuntu
  • MySQL Workbench for importing

I also tried this on Windows 10 and the speed is almost the same on both. So you could try Windows too.

Note: I did try changing the engine to MyISAM. MyISAM can be pretty fast, also around 8000 rows/sec or more. But the import always became corrupted for some reason. So I would stick to InnoDB

Update 17-06-2017:

Finished the import. The table "pagelinks" is about 214Gb large with 1200 million rows. About 112Gb is raw data, 102Gb are indexes. The original uncompressed file was about 37Gb.

It took about 2 days and 6 hours to import. Avg speed = 5350 rows/second. With high end equipment (huge memory, preferably 64Gb or more) and optimal settings, it can probably be done faster. But I let it run on a dedicated machine 24/7 and I wasn't in a hurry, so 2 days seems OK.

Update 18-06-2017:

Also imported "page.sql" because this contains the names connected to the ID's. The uncompressed file is about 5Gb, import took 1 hour. Which seems quick: the pagelink file is about 37Gb which is 7x bigger than "page.sql". Yet takes 50x longer to import. So, there are a few reasons why "pagelinks" takes so long: (A) probably because it doesn't fit in memory (B) The table structure, many data per insert (C) Settings. But most likely it's memory.

Conclusion: try get a PC with 32Gb or 64Gb internal memory. Maybe even more. And use SSD's that can keep up with that memory, 500Gb or more. The SSD is more important than memory so try that first.

like image 116
Dennis de Swart Avatar answered Sep 21 '22 12:09

Dennis de Swart