Hadoop input split size vs block size

Tags:

mapreduce

I am going through hadoop definitive guide, where it clearly explains about input splits. It goes like

Input splits doesn’t contain actual data, rather it has the storage locations to data on HDFS

and

Usually,Size of Input split is same as block size

1) let’s say a 64MB block is on node A and replicated among 2 other nodes(B,C), and the input split size for the map-reduce program is 64MB, will this split just have location for node A? Or will it have locations for all the three nodes A,b,C?

2) Since data is local to all the three nodes how the framework decides(picks) a maptask to run on a particular node?

3) How is it handled if the Input Split size is greater or lesser than block size?

467

asked Jul 18 '13 15:07

rohith

2 Answers

The answer by @user1668782 is a great explanation for the question and I'll try to give a graphical depiction of it.
Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)

enter image description here

If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.

enter image description here

Block 1 contains the entire first record and a 28MB chunk of the second record.
If a mapper is to be run on Block 1, the mapper cannot process since it won't have the entire second record.
This is the exact problem that input splits solve. Input splits respects logical record boundaries.
Lets Assume the input split size is 200MB

enter image description here

Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.
This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.

Hope this helps.

107

answered Oct 02 '22 06:10

tharindu_DG

Block is the physical representation of data. Split is the logical representation of data present in Block.

Block and split size can be changed in properties.

Map reads data from Block through splits i.e. split act as a broker between Block and Mapper.

Consider two blocks:

Block 1

aa bb cc dd ee ff gg hh ii jj

Block 2

ww ee yy uu oo ii oo pp kk ll nn

Now map reads block 1 till aa to JJ and doesn't know how to read block 2 i.e. block doesn't know how to process different block of information. Here comes a Split it will form a Logical grouping of Block 1 and Block 2 as single Block, then it forms offset(key) and line (value) using inputformat and record reader and send map to process further processing.

If your resource is limited and you want to limit the number of maps you can increase the split size. For example: If we have 640 MB of 10 blocks i.e. each block of 64 MB and resource is limited then you can mention Split size as 128 MB then then logical grouping of 128 MB is formed and only 5 maps will be executed with a size of 128 MB.

If we specify split size is false then whole file will form one input split and processed by one map which it takes more time to process when file is big.

answered Oct 02 '22 08:10

VG P

Related questions
                            
                                How does partitioning in MapReduce exactly work?
                            
                                Hbase / Hadoop Query Help
                            
                                Hadoop distributions [closed]
                            
                                Add PARTITION after creating TABLE in hive
                            
                                Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)
                            
                                issue Running Spark Job on Yarn Cluster
                            
                                What is meant by sparse data/ datastore/ database?
                            
                                Hadoop gzip compressed files
                            
                                Where does Hadoop store the logs of YARN applications?
                            
                                Exception while deleting Spark temp dir in Windows 7 64 bit
                            
                                hadoop 2.2.0 64-bit installing but cannot start
                            
                                identityreducer in the new Hadoop API
                            
                                Merging hdfs files
                            
                                Role of datanode, regionserver in Hbase-hadoop integration
                            
                                Difference between Application Manager and Application Master in YARN?
                            
                                How to get names of the currently running hadoop jobs?
                            
                                How does Hadoop Namenode failover process works?
                            
                                How to change date format in hive?
                            
                                Iterate twice on values (MapReduce)
                            
                                Does Hive have something equivalent to DUAL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop input split size vs block size

Tags:

hadoop

mapreduce

rohith

People also ask

2 Answers

tharindu_DG

VG P

Recent Activity

Donate For Us