could any body please explain me what is meant by Indexing process in Hadoop. Is it something like a traditional indexing of data that we do in RDBMS, so drawing the same analogy here in Hadoop we index the data blocks and store the physical address of the blocks in some data structure. So it will be an additional space in the Cluster.
Googled around this topic but could not get any satisfactory and detailed things. Any pointers will help.
Thanks in advance
Hadoop stores data in files, and does not index them. To find something, we have to run a MapReduce job going through all the data. Hadoop is efficient where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data.
However, we can use indexing in HDFS using two types viz. file based indexing & InputSplit based indexing. Lets assume that we have 2 Files to store in HDFS for processing. First one is of 500 MB and 2nd one is around 250 MB. Hence we'll have 4 InputSplits of 128MB each on 1st File and 3 InputSplits on 2nd file. We can apply 2 types of indexing for the mentioned case - 1. With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query 2. With InputSplit based indexing, you will end up with 4 InputSplits. The performance should be definitely better than doing a full scan query.
Now, to for implementing InputSplits index we need to perform following steps:
For Code Sample and other details Refer this -
https://hadoopi.wordpress.com/2013/05/24/indexing-on-mapreduce-2/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With