Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wildcard in Hadoop's FileSystem listing API calls

tl;dr: To be able to use wildcards (globs) in the listed paths, one simply has to use globStatus(...) instead of listStatus(...).


Context

Files on my HDFS cluster are organized in partitions, with the date being the "root" partition. A simplified example of the files structure would look like this:

/schemas_folder
├── date=20140101
│   ├── A-schema.avsc
│   ├── B-schema.avsc
├── date=20140102
│   ├── A-schema.avsc
│   ├── B-schema.avsc
│   ├── C-schema.avsc
└── date=20140103
    ├── B-schema.avsc
    └── C-schema.avsc

In my case, the directory stores Avro schemas for different types of data (A, B and C in this example) at different dates. The schema might start existing, evolve and stop existing... as time passes.


Goal

I need to be able to get all the schemas that exist for a given type, as quickly as possible. In the example where I would like to get all the schemas that exist for type A, I would like to do the following:

hdfs dfs -ls /schemas_folder/date=*/A-schema.avsc

That would give me

Found 1 items
-rw-r--r--   3 user group 1234 2014-01-01 12:34 /schemas_folder/date=20140101/A-schema.avsc
Found 1 items
-rw-r--r--   3 user group 2345 2014-01-02 23:45 /schemas_folder/date=20140102/A-schema.avsc

Problem

I don't want to be using the shell command, and cannot seem to find the equivalent to that command above in the Java APIs. When I try to implement the looping myself, I get terrible performance. I want at least the performance of the command line (around 3 seconds in my case)...


What I found so far

One can notice that it prints twice Found 1 items, once before each result. It does not print Found 2 items once at the beginning. That probably hints that wildcards are not implemented on the FileSystem side but somehow handled by the client. I can't seem to find the right source code to look at to see how that was implemented.

Below are my first shots, probably a bit too naïve...

Using listFiles(...)

Code:

RemoteIterator<LocatedFileStatus> files = filesystem.listFiles(new Path("/schemas_folder"), true);
Pattern pattern = Pattern.compile("^.*/date=[0-9]{8}/A-schema\\.avsc$");
while (files.hasNext()) {
    Path path = files.next().getPath();
    if (pattern.matcher(path.toString()).matches())
    {
        System.out.println(path);
    }
}

Result:

This prints exactly what I would expect, but since it first lists everything recursively and then filters, the performance is really poor. With my current dataset, it takes almost 25 seconds...

Using listStatus(...)

Code:

FileStatus[] statuses = filesystem.listStatus(new Path("/schemas_folder"), new PathFilter()
{
    private final Pattern pattern = Pattern.compile("^date=[0-9]{8}$");

    @Override
    public boolean accept(Path path)
    {
        return pattern.matcher(path.getName()).matches();
    }
});
Path[] paths = new Path[statuses.length];
for (int i = 0; i < statuses.length; i++) { paths[i] = statuses[i].getPath(); }
statuses = filesystem.listStatus(paths, new PathFilter()
{
    @Override
    public boolean accept(Path path)
    {
        return "A-schema.avsc".equals(path.getName());
    }
});
for (FileStatus status : statuses)
{
    System.out.println(status.getPath());
}

Result:

Thanks to the PathFilters and the use of arrays, it seems to perform faster (around 12 seconds). The code is more complex, though, and more difficult to adapt to different situations. Most importantly, the performance is still 3 to 4 times slower than the command-line version!


Question

What am I missing here? What is the fastest way to get the results I want?


Updates

2014.07.09 - 13:38

The proposed answer of Mukesh S is apparently the best possible API approach.

In the example I gave above, the code end-up looking like this:

FileStatus[] statuses = filesystem.globStatus(new Path("/schemas_folder/date=*/A-schema.avsc"));
for (FileStatus status : statuses)
{
    System.out.println(status.getPath());
}

This is the best looking and best performing code I could come up with so far, but is still not performing as well as the shell version.

like image 824
snooze92 Avatar asked Jul 09 '14 07:07

snooze92


People also ask

What is Hadoop API?

The Hadoop YARN web service REST APIs are a set of URI resources that give access to the cluster, nodes, applications, and application historical information. The URI resources are grouped into APIs based on the type of information returned. Some URI resources return collections while others return singletons.

What is Hdfs URI?

The URI format is scheme://authority/path. For HDFS the scheme is hdfs, and for the Local FS the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used.

What does the globStatus methods return?

The globStatus() methods return an array of FileStatus objects whose paths match the supplied pattern, sorted by path. An optional PathFilter can be specified to restrict the matches further.


1 Answers

Instead of listStatus you can try hadoops globStatus. Hadoop provides two FileSystem method for processing globs:

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException

An optional PathFilter can be specified to restrict the matches further.

For more description you can check Hadoop:Definitive Guide here

Hope it helps..!!!

like image 182
Mukesh S Avatar answered Sep 20 '22 10:09

Mukesh S