tl;dr:
To be able to use wildcards (globs) in the listed paths, one simply has to use globStatus(...)
instead of listStatus(...)
.
Files on my HDFS cluster are organized in partitions, with the date being the "root" partition. A simplified example of the files structure would look like this:
/schemas_folder
├── date=20140101
│ ├── A-schema.avsc
│ ├── B-schema.avsc
├── date=20140102
│ ├── A-schema.avsc
│ ├── B-schema.avsc
│ ├── C-schema.avsc
└── date=20140103
├── B-schema.avsc
└── C-schema.avsc
In my case, the directory stores Avro schemas for different types of data (A, B and C in this example) at different dates. The schema might start existing, evolve and stop existing... as time passes.
I need to be able to get all the schemas that exist for a given type, as quickly as possible. In the example where I would like to get all the schemas that exist for type A, I would like to do the following:
hdfs dfs -ls /schemas_folder/date=*/A-schema.avsc
That would give me
Found 1 items
-rw-r--r-- 3 user group 1234 2014-01-01 12:34 /schemas_folder/date=20140101/A-schema.avsc
Found 1 items
-rw-r--r-- 3 user group 2345 2014-01-02 23:45 /schemas_folder/date=20140102/A-schema.avsc
I don't want to be using the shell command, and cannot seem to find the equivalent to that command above in the Java APIs. When I try to implement the looping myself, I get terrible performance. I want at least the performance of the command line (around 3 seconds in my case)...
One can notice that it prints twice Found 1 items
, once before each result. It does not print Found 2 items
once at the beginning. That probably hints that wildcards are not implemented on the FileSystem
side but somehow handled by the client. I can't seem to find the right source code to look at to see how that was implemented.
Below are my first shots, probably a bit too naïve...
RemoteIterator<LocatedFileStatus> files = filesystem.listFiles(new Path("/schemas_folder"), true);
Pattern pattern = Pattern.compile("^.*/date=[0-9]{8}/A-schema\\.avsc$");
while (files.hasNext()) {
Path path = files.next().getPath();
if (pattern.matcher(path.toString()).matches())
{
System.out.println(path);
}
}
This prints exactly what I would expect, but since it first lists everything recursively and then filters, the performance is really poor. With my current dataset, it takes almost 25 seconds...
FileStatus[] statuses = filesystem.listStatus(new Path("/schemas_folder"), new PathFilter()
{
private final Pattern pattern = Pattern.compile("^date=[0-9]{8}$");
@Override
public boolean accept(Path path)
{
return pattern.matcher(path.getName()).matches();
}
});
Path[] paths = new Path[statuses.length];
for (int i = 0; i < statuses.length; i++) { paths[i] = statuses[i].getPath(); }
statuses = filesystem.listStatus(paths, new PathFilter()
{
@Override
public boolean accept(Path path)
{
return "A-schema.avsc".equals(path.getName());
}
});
for (FileStatus status : statuses)
{
System.out.println(status.getPath());
}
Thanks to the PathFilter
s and the use of arrays, it seems to perform faster (around 12 seconds). The code is more complex, though, and more difficult to adapt to different situations. Most importantly, the performance is still 3 to 4 times slower than the command-line version!
What am I missing here? What is the fastest way to get the results I want?
The proposed answer of Mukesh S is apparently the best possible API approach.
In the example I gave above, the code end-up looking like this:
FileStatus[] statuses = filesystem.globStatus(new Path("/schemas_folder/date=*/A-schema.avsc"));
for (FileStatus status : statuses)
{
System.out.println(status.getPath());
}
This is the best looking and best performing code I could come up with so far, but is still not performing as well as the shell version.
The Hadoop YARN web service REST APIs are a set of URI resources that give access to the cluster, nodes, applications, and application historical information. The URI resources are grouped into APIs based on the type of information returned. Some URI resources return collections while others return singletons.
The URI format is scheme://authority/path. For HDFS the scheme is hdfs, and for the Local FS the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used.
The globStatus() methods return an array of FileStatus objects whose paths match the supplied pattern, sorted by path. An optional PathFilter can be specified to restrict the matches further.
Instead of listStatus you can try hadoops globStatus. Hadoop provides two FileSystem method for processing globs:
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException
An optional PathFilter can be specified to restrict the matches further.
For more description you can check Hadoop:Definitive Guide here
Hope it helps..!!!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With