HDFS block size Vs actual file size

Tags:

I know that HDFS stores data using the regular linux file system in the data nodes. My HDFS block size is 128 MB. Lets say that I have 10 GB of disk space in my hadoop cluster that means, HDFS initially has 80 blocks as available storage.

If I create a small file of say 12.8 MB, #available HDFS blocks will become 79. What happens if I create another small file of 12.8 MB? Will the #availbale blocks stay at 79 or will it come down to 78? In the former case, HDFS basically recalculates the #available blocks after each block allocation based on the available free disk space so, #available blocks will become 78 only after more than 128 MB of disk space is consumed. Please clarify.

224

asked Feb 25 '13 08:02

NPE

1 Answers

The best way to know is to try it, see my results bellow.

But before trying, my guess is that even if you can only allocate 80 full blocks in your configuration, you can allocate more than 80 non-empty files. This is because I think HDFS does not use a full block each time you allocate a non-empty file. Said in another way, HDFS blocks are not a storage allocation unit, but a replication unit. I think the storage allocation unit of HDFS is the unit of the underlying filesystem (if you use ext4 with a block size of 4 KB and you create a 1 KB file in a cluster with replication factor of 3, you consume 3 times 4 KB = 12 KB of hard disk space).

Enough guessing and thinking, let's try it. My lab configuration is as follow:

hadoop version 1.0.4
4 data nodes, each with a little less than 5.0G of available space, ext4 block size of 4K
block size of 64 MB, default replication of 1

After starting HDFS, I have the following NameNode summary:

1 files and directories, 0 blocks = 1 total
DFS Used: 112 KB
DFS Remaining: 19.82 GB

Then I do the following commands:

hadoop fs -mkdir /test
for f in $(seq 1 10); do hadoop fs -copyFromLocal ./1K_file /test/$f; done

With these results:

12 files and directories, 10 blocks = 22 total
DFS Used: 122.15 KB
DFS Remaining: 19.82 GB

So the 10 files did not consume 10 times 64 MB (no modification of "DFS Remaining").

151

answered Oct 30 '22 22:10

jfg956

Related questions
                            
                                How to write pyspark dataframe to HDFS and then how to read it back into dataframe?
                            
                                Loop over files in HDFS directory
                            
                                How to update a file in HDFS
                            
                                Difference between `load data inpath ` and `location` in hive?
                            
                                High throughput vs low latency in HDFS
                            
                                Hadoop FileSystem closed exception when doing BufferedReader.close()
                            
                                How to read/write protocol buffer messages with Apache Spark?
                            
                                HDFS replication factor
                            
                                How does HDFS with append works
                            
                                Is there a way to add nodes to a running Hadoop cluster?
                            
                                Spark and Java: Exception thrown in awaitResult
                            
                                How can I save an RDD into HDFS and later read it back?
                            
                                create a schema in hive
                            
                                Hadoop - Restart datanode and tasktracker
                            
                                hadoop/hdfs/name is in an inconsistent state: storage directory(hadoop/hdfs/data/) does not exist or is not accessible
                            
                                Read whole text files from a compression in Spark
                            
                                Connection reset by peer while running Apache Spark Job
                            
                                Hadoop - FileSystem.listFiles - not listing directories
                            
                                Hadoop: How can i merge reducer outputs to a single file? [duplicate]
                            
                                What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HDFS block size Vs actual file size

Tags:

filesize

hdfs

NPE

People also ask

1 Answers

jfg956

Recent Activity

Donate For Us