How to read compressed bz2 (bzip2) Wikipedia dumps into stream xml record reader for hadoop map reduce

Question

I am working on using Hadoop Map Reduce to do research on the wikipedia data dumps (compressed in bz2 format). Since these dumps are so big (5 T), I can't decompress the xml data into HDFS and just use the StreamXmlRecordReader that hadoop provides. Hadoop does support uncompressing bz2 files, but it splits the pages arbitrarily and sends those to the mapper. Because this is xml, we need the splits to be a tags. Is there anyway to use the built in bz2 decompression and stream xml record reader provided by hadoop together?

DrDee · Accepted Answer

The Wikimedia Foundation just released an InputReader for the Hadoop Streaming interface that is able to read the bz2 compressed full dump files and send it to your mappers. The unit being send to a mapper is not a whole page but two revisions (so you can actually run a diff on the two revisions). This is the initial release and I am sure there will be some bugs but please give it a spin and help us test it.

This InputReader requires Hadoop 0.21 as Hadoop 0.21 has streaming support for bz2 files. The source code is available at: https://github.com/whym/wikihadoop

How to read compressed bz2 (bzip2) Wikipedia dumps into stream xml record reader for hadoop map reduce

Tags:

xml

hadoop

streaming

bzip2

wikipedia

Laurel Orr

1 Answers

DrDee

Recent Activity

Donate For Us

How to read compressed bz2 (bzip2) Wikipedia dumps into stream xml record reader for hadoop map reduce

Tags:

xml

hadoop

streaming

bzip2

wikipedia

Laurel Orr

1 Answers

DrDee

Related questions

Recent Activity

Donate For Us