Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing process in Hadoop

Tags:

hadoop

could any body please explain me what is meant by Indexing process in Hadoop. Is it something like a traditional indexing of data that we do in RDBMS, so drawing the same analogy here in Hadoop we index the data blocks and store the physical address of the blocks in some data structure. So it will be an additional space in the Cluster.

Googled around this topic but could not get any satisfactory and detailed things. Any pointers will help.

Thanks in advance

like image 491
Divas Avatar asked Dec 16 '25 14:12

Divas


1 Answers

Hadoop stores data in files, and does not index them. To find something, we have to run a MapReduce job going through all the data. Hadoop is efficient where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data.

However, we can use indexing in HDFS using two types viz. file based indexing & InputSplit based indexing. Lets assume that we have 2 Files to store in HDFS for processing. First one is of 500 MB and 2nd one is around 250 MB. Hence we'll have 4 InputSplits of 128MB each on 1st File and 3 InputSplits on 2nd file. We can apply 2 types of indexing for the mentioned case - 1. With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query 2. With InputSplit based indexing, you will end up with 4 InputSplits. The performance should be definitely better than doing a full scan query.

Now, to for implementing InputSplits index we need to perform following steps:

  1. Build index from your full data set - This can be achived by writing a MapReduce job to extract the value we want to index, and output it together with its InputSplit MD5 hash.
  2. Get the InputSplit(s) for the indexed value you are looking for - Output of MapReduce program will be Reduced Files (Containing Indices based on InputSplits) which will be stored in HDFS
  3. Execute your actual MapReduce job on indexed InputSplits only. - This can be done by Hadoop as it is able to retrieve the number of InputSplit to be used using the FileInputFormat.class. We will create our own IndexFileInputFormat class extending the default FileInputFormat.class, and overriding its getSplits() method. You have to read the file you have created at previous step, add all your indexed InputSplits into a list, and then compare this list with the one returned by the super class. You will return to JobTracker only the InputSplits that were found in your index.
  4. In Driver class we have now to use this IndexFileInputFormat class. We need to set as InputFormatClass using - To Use our custom IndexFileInputFormat In Driver class we need to provide job.setInputFormatClass(IndexFileInputFormat.class);

For Code Sample and other details Refer this -

https://hadoopi.wordpress.com/2013/05/24/indexing-on-mapreduce-2/

like image 94
user8485334 Avatar answered Dec 19 '25 06:12

user8485334



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!