Optimal block size in HDFS - Can large block sizes hurt

Tags:

I understand the disadvantages of small files and small block sizes in HDFS. I'm trying to understand the rationale behind the the default 64/128 MB block size. Are there any drawbacks of having a large block size (say 2GB. I read that larger values than that cause issues, the details of which I haven't yet dug into).

Issues I see with too large block sizes (please correct me, may be some or all of these issues don't really exist)-

Possibly, there could be issues with replicating a 1 Gig file when a data node goes down - which requires the cluster to transfer the whole file. This seems to be a problem when we are considering a single file - but we may have to transfer a lot many smaller files if we had smaller block sizes say 128 MB (which I think involves more overhead)
Could trouble mappers. Large blocks might end up with each mapper thus reducing the possible number of mappers. But this should not be an issue if we use a smaller split size?
It one sounded stupid when it occurred to me that this could be an issue but I thought I'll throw it in anyways - Since the namenode does not know the size of the file beforehand, it is possible for it to consider a data node not available since it does not have enough disk space for a new block (considering a large block size of may be 1-2 Gigs). But may be it solves this problem smartly by just cutting down the block size of that particular block (which probably is a bad solution resulting).

Block size may probably depend on the use case. I basically want to find an answer to the question - Is there a situation/use case where large block size setup can hurt?

Any help is appreciated. Thanks in advance.

647

asked Jan 22 '14 22:01

Praneeth

1 Answers

I did extensive performance validations of high end clusters on hadoop and we varied the block sizes from 64 Meg up to 2GB. To answer the question: imagine workloads in which oftentimes smallish files need to be processed, say 10's of Megs. Which blocksize do you think will be more performant in that case - 64MEg or 1024Meg?

For the case of large files then yes the large block sizes tend towards better performance since the overhead of mappers is not negligible.

136

answered Oct 01 '22 19:10

WestCoastProjects

Related questions
                            
                                InvalidRequestException(why:empid cannot be restricted by more than one relation if it includes an Equal)
                            
                                Accumulo high speed ingest options
                            
                                How to find the right portion between hadoop instance types
                            
                                Hive unit testing on windows without hadoop setup
                            
                                Can I use Hadoop with AWS4-HMAC-SHA256?
                            
                                Hadoop multiple outputs with speculative execution
                            
                                hadoop 2.4.0 streaming generic parser options using TAB as separator
                            
                                Maximum number of Apache Nutch worker instances
                            
                                Oozie and Job History Server configuration problems
                            
                                How does YARN decide to create how many containers? (Why the difference between S3a and HDFS?)
                            
                                Hadoop: maximum-am-resource-percent is insufficient to start a single application
                            
                                How filter Scan of HBase by part of row key?
                            
                                Run Spark-shell with error :SparkContext: Error initializing SparkContext
                            
                                Hadoop always finishes with java.util.concurrent.TimeoutException
                            
                                Merge delta data into an external table using hive's merge statement
                            
                                Spark concurrent writes on same HDFS location
                            
                                403 Error while accessing s3a using Spark
                            
                                Deploying Mahout on hadoop cluster
                            
                                Time based data analysis with Python
                            
                                How To Refresh/Clear the DistributedCache When Using Hue + Beeswax To Run Hive Queries That Define Custom UDFs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimal block size in HDFS - Can large block sizes hurt

Tags:

hadoop

hdfs

Praneeth

People also ask

1 Answers

WestCoastProjects

Recent Activity

Donate For Us