Can hadoop fs -ls be used to find all directories older than N days (from the current date)?
I am trying to write a clean up routine to find and delete all directories on HDFS (matching a pattern) which were created N days prior to the current date.
Here is a small script to list directories older than 10 days. hadoop fs -ls -R command list all the files and directories in HDFS. grep “^d” will get you only the directories. Then with while..do let's loop through each directory.
If you type hdfs dfs -ls / you will get list of directories in hdfs. Then you can transfer files from local file system to hdfs using -copyFromLocal or -put to a particular directory or using -mkdir you can create new directory.
expunge: This command is used to empty the trash available in an HDFS system.
You will find rm command in your Hadoop fs command. This command is similar to the Linux rm command, and it is used for removing a file from the HDFS file system. The command –rmr can be used to delete files recursively.
hdfs dfs -ls /hadoop/path/*.txt|awk '$6 < "2017-10-24"'
I didn't have the HdfsFindTool
, nor the fsimage
from curl
, and I didn't much like the ls
to grep
with while
loop using date
awk
and hadoop
and awk
again.
But I appreciated the answers.
I felt like it could be done with just one ls
, one awk
, and maybe an xargs
.
I also added the options to list the files or summarize them before choosing to delete them, as well as choose a specific directory. Lastly I leave the directories and only concern myself about the files.
#!/bin/bash
USAGE="Usage: $0 [N days] (list|size|delete) [path, default /tmp/hive]"
if [ ! "$1" ]; then
echo $USAGE
exit 1
fi
AGO="`date --date "$1 days ago" "+%F %R"`"
echo "# Will search for files older than $AGO"
if [ ! "$2" ]; then
echo $USAGE
exit 1
fi
INPATH="${3:-/tmp/hive}"
echo "# Will search under $INPATH"
case $2 in
list)
hdfs dfs -ls -R "$INPATH" |\
awk '$1 ~ /^[^d]/ && ($6 " " $7) < '"\"$AGO\""
;;
size)
hdfs dfs -ls -R "$INPATH" |\
awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {
sum += $5 ; cnt += 1} END {
print cnt, "Files with total", sum, "Bytes"}'
;;
delete)
hdfs dfs -ls -R "$INPATH" |\
awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {print $8}' | \
xargs hdfs dfs -rm -skipTrash
;;
*)
echo $USAGE
exit 1
;;
esac
I hope others find this useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With