First off, I'm new to hadoop :)
I have a large data set of gzipped files (TBs of documents in gzipped files around the size of 100-500mb each).
Basically, I need some sort of filtering of my input to my map-reduce jobs.
I want to analyze these files in various ways. Many of these jobs only need to analyze files of a certain format (of certain length, containing certain words etc - all sorts of arbitrary (inverted) indexes), and it takes unreasonably long to process the entire dataset for each job. So I want to create indexes that points to specific blocks/files in HDFS.
I can generate the required indices manually, but how do I specify exactly which (thousands of) specific files/blocks I want to process as input to mappers? Can I do this without reading the source data into e.g. HBase? Do I want to? Or am I tackling this problem completely wrong?
Assuming you have some way by which you can know which x files to process in a large corpus of files, you can use the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPathFilter(Job, Class<? extends PathFilter>)
method when configuring your job.
You'll need to pass an class that implements PathFilter
. Hadoop will create a new instance of this class and it will be presented which each file in the corpus via the boolean accept(Path path)
method. You can then use this to filter down the files to actual process map tasks against (whether that be based upon the filename, size, last modified timestamp etc).
To target specific blocks, you'll need to implement your own extension of FileInputFormat, specifically overriding the getSplits
method. This method uses the listStatus
method to determine what the input files are to process (and is where the previously mentioned PathFilter is invoked), after which it then determines how to break up those files into splits (if the files are splittable). So in this getSplits
method you'll again need to use your reference data to target the specific splits you're interested in.
As for storing / retrieving this target file and splits information, you have several choices of persistence store such as a Key / Value store (HBase, as you noted in your question), a separate Database (MySQL, etc), an inverted index (Lucene) etc.
Because you'd like to filter input based on the file content (file containing the word foobar) and not file metadata (file name / size, etc.) you would actually need the kind of indexes I created based on Hadoop InputSplit. See my blog
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With