Deleting files from HDFS does not free up disk space

Cluster setup

We are running a four node cluster on physical, dedicated hardware, with some 110 TB of total storage capacity. On april 3, we upgraded the CDH software from the 5.0.0-beta2 version to version 5.0.0-1.

We previously used to put log data on hdfs in plain text format at a rate of approximately 700 GB/day. On april 1 we switched to importing data as .gz files instead, which lowered the daily ingestion rate to about 130 GB.

Since we only want to retain data up to a certain age, there is a nightly job to delete obsolete files. The result of this used to be clearly visible in the hdfs capacity monitoring chart, but can no longer be seen.

Sine we import about 570 GB less data than we delete every day, one would expect the capacity used to go down. But instead our reported hdfs use has been constantly growing since the cluster software was upgraded.

Problem description

Running hdfs hadoop fs -du -h / gives the following output:

0       /system
1.3 T   /tmp
24.3 T  /user

This is consistent with what we expect to see, given the size of the imported files. Using a replication factor of 3, this should correspond to a physical disk usage of about 76.8 TB.

When instead running hdfs dfsadmin -report the result is different:

Configured Capacity: 125179101388800 (113.85 TB)
Present Capacity: 119134820995005 (108.35 TB)
DFS Remaining: 10020134191104 (9.11 TB)
DFS Used: 109114686803901 (99.24 TB)
DFS Used%: 91.59%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Here, DFS Used is reported as 99.24 TB, which is what we see in the monitoring chart. Where did all that data come from?

What we have tried

The first thing we suspected was that the automatic emptying of trash was not working, but that does not seem to be the case. Only the most recently deleted files are in trash, and they automatically disappear after a day.

Our issue is seem very similar to what would happen if a hdfs metadata upgrade was performed but not finalized. I don't think this is needed when upgrading between these versions, but have still performed both steps 'just in case'.

On the DN storage volumes in the local file system, there is a lot of data under `previous/finalized'. I have too little knowledge of the implementation details of hdsf to know if this is significant, but it could indicate something with the finalization is out of synch.

We will soon run out of disk space on the cluster, so any help is much appreciated.

814

asked Apr 14 '14 10:04

knutn

1 Answers

I found a similar issue on our cluster, which stemmed probably from a failed upgrade.

First make sure to finalize the upgrade on the namenode

hdfs dfsadmin -finalizeUpgrade

What I found was that the datanodes for some reason did not finalize their directories at all.

On your datanode, you should see the following directory layout

/[mountpoint}/dfs/dn/current/{blockpool}/current

And

/[mountpoint}/dfs/dn/current/{blockpool}/previous

If you have not finalized the previous directory contains all data that was created before the update. If you delete anything it will not remove it - hence your storage never reduces.

Actually the most simplest solution was sufficient

Restart the namenode

Watch the log of the datanode, you should see something like this

INFO org.apache.hadoop.hdfs.server.common.Storage: Finalizing upgrade for storage directory

Afterwards the directories will be cleared in the background and the storage reclaimed.

114

answered Oct 03 '22 06:10

Joey

Related questions
                            
                                package org.apache.hadoop.conf does not exist after setting classpath
                            
                                unable to create hive table with primary key
                            
                                HADOOP / YARN - Are the ResourceManager and the hdfs NameNode always installed on the same host?
                            
                                Hive query stuck at 99%
                            
                                What is the difference between Statement.setMaxRows vs Statement.setFetchsize in Hive
                            
                                Different ways to import files into HDFS
                            
                                How many types of InputFormat is there in Hadoop?
                            
                                What is the principle of "code moving to data" rather than data to code?
                            
                                Spark job just hangs with large data
                            
                                Unable to run UDF on hive server
                            
                                Generating all fields from an alias after a JOIN in Pig
                            
                                hadoop fs commands are showing the local filesystem not the hdfs
                            
                                Hadoop: FSCK result shows missing replicas
                            
                                Unable to establish a JDBC connection to Hive from Eclipse
                            
                                merge multiple small files in to few larger files in Spark
                            
                                Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y
                            
                                Forward fill missing values in Spark/Python
                            
                                Hive Data to Pandas Data frame
                            
                                Stream data into hdfs directly without copying
                            
                                org.apache.maven.plugin.MojoExecutionException: protoc failure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Deleting files from HDFS does not free up disk space

Tags:

hadoop

hdfs

cloudera-cdh