Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding directories older than N days in HDFS

Tags:

hadoop

hdfs

Can hadoop fs -ls be used to find all directories older than N days (from the current date)?

I am trying to write a clean up routine to find and delete all directories on HDFS (matching a pattern) which were created N days prior to the current date.

like image 390
vid12 Avatar asked Sep 27 '12 03:09

vid12


People also ask

How will you figure out the 10 days old data from HDFS or Linux?

Here is a small script to list directories older than 10 days. hadoop fs -ls -R command list all the files and directories in HDFS. grep “^d” will get you only the directories. Then with while..do let's loop through each directory.

How do I search for a directory in HDFS?

If you type hdfs dfs -ls / you will get list of directories in hdfs. Then you can transfer files from local file system to hdfs using -copyFromLocal or -put to a particular directory or using -mkdir you can create new directory.

What is expunge in HDFS?

expunge: This command is used to empty the trash available in an HDFS system.

How do I clean up my HDFS files?

You will find rm command in your Hadoop fs command. This command is similar to the Linux rm command, and it is used for removing a file from the HDFS file system. The command –rmr can be used to delete files recursively.


2 Answers

hdfs dfs -ls /hadoop/path/*.txt|awk '$6 < "2017-10-24"'

like image 93
Amol Kulkarni Avatar answered Nov 07 '22 21:11

Amol Kulkarni


I didn't have the HdfsFindTool, nor the fsimage from curl, and I didn't much like the ls to grep with while loop using date awk and hadoop and awk again. But I appreciated the answers.

I felt like it could be done with just one ls, one awk, and maybe an xargs.

I also added the options to list the files or summarize them before choosing to delete them, as well as choose a specific directory. Lastly I leave the directories and only concern myself about the files.

#!/bin/bash
USAGE="Usage: $0 [N days] (list|size|delete) [path, default /tmp/hive]"
if [ ! "$1" ]; then
  echo $USAGE
  exit 1
fi
AGO="`date --date "$1 days ago" "+%F %R"`"

echo "# Will search for files older than $AGO"
if [ ! "$2" ]; then
  echo $USAGE
  exit 1
fi
INPATH="${3:-/tmp/hive}"

echo "# Will search under $INPATH"
case $2 in
  list)
    hdfs dfs -ls -R "$INPATH" |\
      awk '$1 ~ /^[^d]/ && ($6 " " $7) < '"\"$AGO\""
  ;;
  size)
    hdfs dfs -ls -R "$INPATH" |\
      awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {
           sum += $5 ; cnt += 1} END {
           print cnt, "Files with total", sum, "Bytes"}'
  ;;
  delete)
    hdfs dfs -ls -R "$INPATH" |\
      awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {print $8}' | \
      xargs hdfs dfs -rm -skipTrash
  ;;
  *)
    echo $USAGE
    exit 1
  ;;
esac

I hope others find this useful.

like image 29
dlamblin Avatar answered Nov 07 '22 21:11

dlamblin