Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to list only the file names in HDFS

Tags:

shell

hadoop

I would like to know is there any command/expression to get only the file name in hadoop. I need to fetch only the name of file, when I do hadoop fs -ls it prints the whole path.

I tried below but just wondering if some better way to do it.

hadoop fs -ls <HDFS_DIR>|cut -d ' ' -f17 
like image 911
Navneet Kumar Avatar asked Feb 05 '14 05:02

Navneet Kumar


People also ask

How do I list only files in HDFS?

Use the hdfs dfs -ls command to list files in Hadoop archives. Run the hdfs dfs -ls command by specifying the archive directory location. Note that the modified parent argument causes the files to be archived relative to /user/ .

How do I list files from HDFS path?

Usage: hadoop fs -ls [-d] [-h] [-R] [-t] [-S] [-r] [-u] <args> Options: -d: Directories are listed as plain files. -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864). -R: Recursively list subdirectories encountered. -t: Sort output by modification time (most recent first).

Which command is used to list the files in Hadoop?

The ls command in Hadoop shows the list of files/contents in a specified directory, i.e., path. On adding “R” before /path, the output will show details of the content, such as names, size, owner, and so on for each file specified in the given directory.

How do I recursively list files in HDFS?

Use -R followed by ls command to list files/directorires recursively. -d : Directories are listed as plain files. -h "Formats the sizes of files in a human-readable fashion rather than a number of bytes. -R "Recursively list the contents of directories.


2 Answers

The following command will return filenames only:

hdfs dfs -stat "%n" my/path/* 

:added at Feb 04 '21

Actually last few years I use

hdfs dfs -ls -d my/path/* | awk '{print $8}'

and

hdfs dfs -ls my/path | grep -e "^-" | awk '{print $8}'

like image 51
MichealKum Avatar answered Sep 27 '22 20:09

MichealKum


It seems hadoop ls does not support any options to output just the filenames, or even just the last column.

If you want get the last column reliably, you should first convert the whitespace to a single space, so that you can then address the last column:

hadoop fs -ls | sed '1d;s/  */ /g' | cut -d\  -f8

This will get you just the last column but files with the whole path. If you want just filenames, you can use basename as @rojomoke suggests:

hadoop fs -ls | sed '1d;s/  */ /g' | cut -d\  -f8 | xargs -n 1 basename

I also filtered out the first line that says Found ?x items

Note: beware that, as @felix-frank notes in the comments, that the above command will not correctly preserve file names with multiple consecutive spaces. Hence a more correct solution proposed by Felix:

hadoop fs -ls /tmp | sed 1d | perl -wlne'print +(split " ",$_,8)[7]'

like image 23
Jakub Kotowski Avatar answered Sep 27 '22 19:09

Jakub Kotowski