Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop & Bash: delete filenames matching range

Tags:

bash

hadoop

Say you have a list of files in HDFS with a common prefix and an incrementing suffix. For example,

part-1.gz, part-2.gz, part-3.gz, ..., part-50.gz

I only want to leave a few file in the directory, say 3. Any three files will do. The files will be used for testing so the choice of files doesn't matter.

What's the simples & fastest way to delete the 47 other files?

like image 799
volni Avatar asked Oct 11 '11 22:10

volni


3 Answers

Few options here:


Move three files manually over to a new folder, then delete the old folder.


Grab the files names with fs -ls, then pull the top n, then rm them. This is the most robust method, in my opinion.

hadoop fs -ls /path/to/files gives you ls output

hadoop fs -ls /path/to/files | grep 'part' | awk '{print $8}' prints out only the file names (adjust the grep accordingly to grab the files you want).

hadoop fs -ls /path/to/files | grep 'part' | awk '{print $8}' | head -n47 grabs the top 47

Throw this into a for loop and rm them:

for k in `hadoop fs -ls /path/to/files | grep part | awk '{print $8}' | head -n47`
do
   hadoop fs -rm $k
done

Instead of a for-loop, you could use xargs:

hadoop fs -ls /path/to/files | grep part | awk '{print $8}' | head -n47 | xargs hadoop fs -rm

Thanks to Keith for the inspiration

like image 115
Donald Miner Avatar answered Oct 21 '22 23:10

Donald Miner


In Bash?

What files do you want to keep and why? What are their names? In the above example, you could do something like this:

$ rm !(part-[1-3].gz)

which will remove all files except part-1.gz, part-2.gz, and part-3.gz.

You can also do something like this:

$ rm $(ls | sed -n '4,$p')

Which will remove all except the last three files listed.

You could also do this:

$ls | sed -n '4,$p' | xargs rm

Which is safer if you have hundreds and hundreds of files in the directory.

like image 4
David W. Avatar answered Oct 21 '22 21:10

David W.


Do you need to keep the first three or the last three?

To remove all but the first three:

hadoop fs -ls | grep 'part-[0-9]*\.gz' | sort -g -k2 -t- | tail -n +4 | xargs -r -d\\n hadoop fs -rm

To remove all but the last three:

hadoop fs -ls | grep 'part-[0-9]*\.gz' | sort -g -k2 -t- | head -n -3 | xargs -r -d\\n hadoop fs -rm

Note that these commands don't depend on the actual number of files, nor on the existence of more than three, nor on the precise sorting of the original listing, but they do depend on the fact that the number is after a hyphen. The parameters to xargs aren't strictly necessary, but they may be helpful in certain situations.

like image 3
eswald Avatar answered Oct 21 '22 22:10

eswald