Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count number of files under specific directory in hadoop?

I'm new to map-reduce framework. I want to find out the number of files under a specific directory by providing the name of that directory. e.g. Suppose we have 3 directories A, B, C and each one is having 20, 30, 40 part-r files respectively. So I'm interested in writing a hadoop job, which will count files/records in each directory i.e I want an output in below formatted .txt file:

A is having 20 records

B is having 30 records

C is having 40 records

These all directories are present in HDFS.

like image 608
Prasanna Avatar asked Aug 05 '16 05:08

Prasanna


People also ask

What is Count command in hadoop?

The Hadoop fs shell command count counts the number of files, directories, and bytes under the paths that matches the specified file pattern. Options: -q – shows quotas(quota is the hard limit on the number of names and amount of space used for individual directories) -u – it limits output to show quotas and usage only.

How do I find the size of a directory in hadoop?

You can use the “hadoop fs -ls command”. This command displays the list of files in the current directory and all it's details.In the output of this command, the 5th column displays the size of file in bytes.

What command is used to list the files in a folder in hadoop?

Hadoop ls Command The ls command in Hadoop shows the list of files/contents in a specified directory, i.e., path. On adding “R” before /path, the output will show details of the content, such as names, size, owner, and so on for each file specified in the given directory.

How do I list all files in HDFS and size?

You can use hadoop fs -ls command to list files in the current directory as well as their details. The 5th column in the command output contains file size in bytes.


1 Answers

The simplest/native approach is to use built in hdfs commands, in this case -count:

hdfs dfs -count /path/to/your/dir  >> output.txt

Or if you prefer a mixed approach via Linux commands:

hadoop fs -ls /path/to/your/dir/*  | wc -l >> output.txt

Finally the MapReduce version has already been answered here:

How do I count the number of files in HDFS from an MR job?

Code:

int count = 0;
FileSystem fs = FileSystem.get(getConf());
boolean recursive = false;
RemoteIterator<LocatedFileStatus> ri = fs.listFiles(new Path("hdfs://my/path"), recursive);
while (ri.hasNext()){
    count++;
    ri.next();
}
System.out.println("The count is: " + count);
like image 177
Petro Avatar answered Sep 22 '22 18:09

Petro