Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read compressed bz2 (bzip2) Wikipedia dumps into stream xml record reader for hadoop map reduce

I am working on using Hadoop Map Reduce to do research on the wikipedia data dumps (compressed in bz2 format). Since these dumps are so big (5 T), I can't decompress the xml data into HDFS and just use the StreamXmlRecordReader that hadoop provides. Hadoop does support uncompressing bz2 files, but it splits the pages arbitrarily and sends those to the mapper. Because this is xml, we need the splits to be a tags. Is there anyway to use the built in bz2 decompression and stream xml record reader provided by hadoop together?

like image 886
Laurel Orr Avatar asked Jul 17 '11 19:07

Laurel Orr


1 Answers

The Wikimedia Foundation just released an InputReader for the Hadoop Streaming interface that is able to read the bz2 compressed full dump files and send it to your mappers. The unit being send to a mapper is not a whole page but two revisions (so you can actually run a diff on the two revisions). This is the initial release and I am sure there will be some bugs but please give it a spin and help us test it.

This InputReader requires Hadoop 0.21 as Hadoop 0.21 has streaming support for bz2 files. The source code is available at: https://github.com/whym/wikihadoop

like image 112
DrDee Avatar answered Sep 28 '22 12:09

DrDee