Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multistream Wikipedia dump

I downloaded the german wikipedia dump dewiki-20151102-pages-articles-multistream.xml. My short question is: What does the 'multistream' mean in this case?

like image 452
m4ri0 Avatar asked Nov 11 '15 00:11

m4ri0


People also ask

What is Wikipedia multistream?

multistream allows the use of an index to decompress sections as needed without having to decompress the entire thing. This allows a reader to pull articles out of a compressed dump.

How do you read a Wikipedia dump?

Instead, we can access a dump of all of Wikipedia through Wikimedia at dumps.wikimedia.org. (A dump refers to a periodic snapshot of a database). The English version is at dumps.wikimedia.org/enwiki.


1 Answers

The dumps are compressed using bz2, bz2 support a parallel version allowing it to compress/decompress files faster . Compressed data using the parallel version is tagged as multistream.

Knowing this information makes a difference when you are processing the dump from a programming language, since you have to pass a flag to tell the library how to uncompress it (parallel or non parallel).

like image 176
David Przybilla Avatar answered Oct 21 '22 16:10

David Przybilla