Behavior of the parameter "mapred.min.split.size" in HDFS

Tags:

The parameter "mapred.min.split.size" changes the size of the block in which the file was written earlier? Assuming a situation where I, when starting my JOB, pass the parameter "mapred.min.split.size" with a value of 134217728 (128MB). What is correct to say about what happens?

1 - Each MAP process the equivalent of 2 HDFS blocks (assuming each block 64MB);

2 - There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M;

748

asked Oct 04 '13 18:10

Alexandre

2 Answers

The split size is calculated by the formula:-

max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))

In your case it will be:-

split size=max(128,min(Long.MAX_VALUE(default),64))

So above inference:-

each map will process 2 hdfs blocks(assuming each block 64MB): True
There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M: False

but making the minimum split size greater than the block size increases the split size, but at the cost of locality.

answered Sep 30 '22 03:09

Ankit Singhal

Assume that the minimum split size is defined 128mb and the minimum block size is defined 64mb.

NOTE: As each block will be replicated to 3 different datanodes by HDFS by default. Also each map task performs its operation on single block.

Hence, the 128mb split size will consider 2 blocks as a single block and create a single map task for it that will run on a single datanode. This happens at the cost of data-locality. By "cost of data-locality" I am talking about the block that is residing on the datanode on which the map task is not running. Which has to be fetched from that datanode and processed on the datanode on which the map task is running, resulting in lower performance.

However if we consider a file of size 128mb, with default block size of 64mb and a default minimum split size of 64mb, then in that case as normally happens two map tasks will be created for each 64mb of block.

answered Sep 30 '22 02:09

Baban Gaigole

Related questions
                            
                                Why submitting job to mapreduce takes so much time in General?
                            
                                Spark job is failed due to java.io.NotSerializableException: org.apache.spark.SparkContext
                            
                                HIVE - INSERT OVERWRITE using WITH CLAUSE
                            
                                Checking if directory in HDFS is empty or not
                            
                                How to convert a date format YYYY-MM-DD into integer YYYYMMDD in Presto/Hive?
                            
                                HDFS File Comparison
                            
                                What is HBase compaction-queue-size at all?
                            
                                How to use Hive without hadoop
                            
                                auxService:mapreduce_shuffle does not exist on hive
                            
                                View gzipped file content in hadoop
                            
                                How to print file tree with hadoop?
                            
                                How to change the FIELD TERMINATED value for an existing Hive table?
                            
                                Get a list of file names from HDFS using python
                            
                                Hadoop block size and file size issue?
                            
                                ./zkServer.sh status Error contacting service. It is probably not running
                            
                                FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
                            
                                Sqoop - Could not find or load main class org.apache.sqoop.Sqoop
                            
                                Hadoop beginners [closed]
                            
                                How can I add a header row to files created from Pig (Hadoop)?
                            
                                Replace the empty or NULL value with specific value in HIVE query result

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Behavior of the parameter "mapred.min.split.size" in HDFS

Tags:

hadoop

hdfs

Alexandre

People also ask

2 Answers

Ankit Singhal

Baban Gaigole

Recent Activity

Donate For Us