How the data is split in Hadoop

Tags:

Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows 200 mappers simultaneously), is each mapper given 2.5 MB of data?

Besides,do all the mappers run simultaneously or some of them might get run in serial?

884

asked Jul 03 '13 22:07

HHH

1 Answers

I just ran a sample MR program based on your question and here is my finding

Input: a file smaller that block size.

Case 1: Number of mapper =1 Result : 1 map task launched. Inputsplit size for each mapper(in this case only one) is same as the input file size.

Case 2: Number of mappers = 5 Result : 5 map tasks launched. Inputsplit size for each mapper is one fifth of the input file size.

Case 3: Number of mappers = 10 Result : 10 map tasks launched. Inputsplit size for each mapper is one 10th of the input file size.

So based on above, for file less then block size,

split size = total input file size / number of map task launched.

Note: But keep in mind that no. of map task is decided by based on input splits.

111

answered Sep 25 '22 05:09

Arijit Banerjee

Related questions
                            
                                How does Apache Flink compare to Mapreduce on Hadoop?
                            
                                How does Hive stores data and what is SerDe?
                            
                                Moving data to hdfs using copyFromLocal switch
                            
                                Accessing a mapper's counter from a reducer
                            
                                java.sql.SQLException: No suitable driver found for jdbc:hive://localhost:10000/default
                            
                                Store images/videos into Hadoop HDFS
                            
                                Hadoop put performance - large file (20gb)
                            
                                what are the replacement for hadoop Job deprecated class
                            
                                Hadoop WordCount example stuck at map 100% reduce 0%
                            
                                How do I delete files in hdfs directory after reading it using scala?
                            
                                Small files and HDFS blocks
                            
                                How to run a jar file in hadoop?
                            
                                First hadoop project error: "Input path does not exist"
                            
                                Edge nodes in hadoop cluster
                            
                                How to start learning hadoop [closed]
                            
                                Class Not Found Exception in Mapreduce wordcount job
                            
                                Understanding LongWritable
                            
                                Apache Spark error while start
                            
                                How to kill an application from the ResourceManager Web UI
                            
                                SafeModeException : Name node is in safe mode

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How the data is split in Hadoop

Tags:

hadoop

mapreduce

hadoop-partitioning

HHH

People also ask

1 Answers

Arijit Banerjee

Recent Activity

Donate For Us