I downloaded the german wikipedia dump dewiki-20151102-pages-articles-multistream.xml. My short question is: What does the 'multistream' mean in this case?
multistream allows the use of an index to decompress sections as needed without having to decompress the entire thing. This allows a reader to pull articles out of a compressed dump.
Instead, we can access a dump of all of Wikipedia through Wikimedia at dumps.wikimedia.org. (A dump refers to a periodic snapshot of a database). The English version is at dumps.wikimedia.org/enwiki.
The dumps are compressed using bz2, bz2 support a parallel version allowing it to compress/decompress files faster .
Compressed data using the parallel version is tagged as multistream
.
Knowing this information makes a difference when you are processing the dump from a programming language, since you have to pass a flag to tell the library how to uncompress it (parallel or non parallel).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With