Wildcard in Hadoop's FileSystem listing API calls

Context

Files on my HDFS cluster are organized in partitions, with the date being the "root" partition. A simplified example of the files structure would look like this:

/schemas_folder
├── date=20140101
│   ├── A-schema.avsc
│   ├── B-schema.avsc
├── date=20140102
│   ├── A-schema.avsc
│   ├── B-schema.avsc
│   ├── C-schema.avsc
└── date=20140103
    ├── B-schema.avsc
    └── C-schema.avsc

In my case, the directory stores Avro schemas for different types of data (A, B and C in this example) at different dates. The schema might start existing, evolve and stop existing... as time passes.

Goal

I need to be able to get all the schemas that exist for a given type, as quickly as possible. In the example where I would like to get all the schemas that exist for type A, I would like to do the following:

hdfs dfs -ls /schemas_folder/date=*/A-schema.avsc

That would give me

Found 1 items
-rw-r--r--   3 user group 1234 2014-01-01 12:34 /schemas_folder/date=20140101/A-schema.avsc
Found 1 items
-rw-r--r--   3 user group 2345 2014-01-02 23:45 /schemas_folder/date=20140102/A-schema.avsc

Problem

I don't want to be using the shell command, and cannot seem to find the equivalent to that command above in the Java APIs. When I try to implement the looping myself, I get terrible performance. I want at least the performance of the command line (around 3 seconds in my case)...

What I found so far

One can notice that it prints twice Found 1 items, once before each result. It does not print Found 2 items once at the beginning. That probably hints that wildcards are not implemented on the FileSystem side but somehow handled by the client. I can't seem to find the right source code to look at to see how that was implemented.

Below are my first shots, probably a bit too naïve...

Using listFiles(...)

Code:

RemoteIterator<LocatedFileStatus> files = filesystem.listFiles(new Path("/schemas_folder"), true);
Pattern pattern = Pattern.compile("^.*/date=[0-9]{8}/A-schema\\.avsc$");
while (files.hasNext()) {
    Path path = files.next().getPath();
    if (pattern.matcher(path.toString()).matches())
    {
        System.out.println(path);
    }
}

Result:

This prints exactly what I would expect, but since it first lists everything recursively and then filters, the performance is really poor. With my current dataset, it takes almost 25 seconds...

Using listStatus(...)

Code:

FileStatus[] statuses = filesystem.listStatus(new Path("/schemas_folder"), new PathFilter()
{
    private final Pattern pattern = Pattern.compile("^date=[0-9]{8}$");

    @Override
    public boolean accept(Path path)
    {
        return pattern.matcher(path.getName()).matches();
    }
});
Path[] paths = new Path[statuses.length];
for (int i = 0; i < statuses.length; i++) { paths[i] = statuses[i].getPath(); }
statuses = filesystem.listStatus(paths, new PathFilter()
{
    @Override
    public boolean accept(Path path)
    {
        return "A-schema.avsc".equals(path.getName());
    }
});
for (FileStatus status : statuses)
{
    System.out.println(status.getPath());
}

Result:

Thanks to the PathFilters and the use of arrays, it seems to perform faster (around 12 seconds). The code is more complex, though, and more difficult to adapt to different situations. Most importantly, the performance is still 3 to 4 times slower than the command-line version!

Question

What am I missing here? What is the fastest way to get the results I want?

Updates

2014.07.09 - 13:38

The proposed answer of Mukesh S is apparently the best possible API approach.

In the example I gave above, the code end-up looking like this:

FileStatus[] statuses = filesystem.globStatus(new Path("/schemas_folder/date=*/A-schema.avsc"));
for (FileStatus status : statuses)
{
    System.out.println(status.getPath());
}

This is the best looking and best performing code I could come up with so far, but is still not performing as well as the shell version.

824

asked Jul 09 '14 07:07

snooze92

1 Answers

Instead of listStatus you can try hadoops globStatus. Hadoop provides two FileSystem method for processing globs:

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException

An optional PathFilter can be specified to restrict the matches further.

For more description you can check Hadoop:Definitive Guide here

Hope it helps..!!!

182

answered Sep 20 '22 10:09

Mukesh S

Related questions
                            
                                No rule to make target NDK
                            
                                Convert Java to C# with a tool, or manually? [duplicate]
                            
                                Why should I use EMF?
                            
                                PMD/CPD: Ignore bits of code using comments
                            
                                Should be logger always final and static?
                            
                                why is the Double.parseDouble making 9999999999999999 to 10000000000000000? [duplicate]
                            
                                call static method given a class object in java
                            
                                Hash a double in Java
                            
                                One Play 2 Framework App - use both java and scala
                            
                                Draw a scaled bitmap to the canvas?
                            
                                Can I access new methods in anonymous inner class with some syntax?
                            
                                ProGuard + Maven with Java 7
                            
                                java.lang.IllegalStateException: Action Bar Tab must have a Callback
                            
                                Does this Java example cause a memory leak?
                            
                                Java, LDAP: Make it not ignore blank passwords?
                            
                                How do I make jackson not serialize primitives with default value
                            
                                Integration Testing with Redis
                            
                                Listing only files in directory [closed]
                            
                                In Java, when is the {a,b,c,...} array shorthand inappropriate, and why?
                            
                                How to use @XmlElement and @XmlRootElement for marshalling object inside an object?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Wildcard in Hadoop's FileSystem listing API calls

Tags:

java

wildcard

hadoop

hdfs

Context

Goal

Problem

What I found so far

Using listFiles(...)

Code:

Result:

Using listStatus(...)

Code:

Result:

Question

Updates

2014.07.09 - 13:38

snooze92

People also ask

1 Answers

Mukesh S

Recent Activity

Donate For Us