Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficently find files in specific directories

Tags:

java

nio

I have a simple problem: I iterate a large and deeply nested directory structure using Files.walkFileTree like this:

final int CUTOFF = 5;
final List<Path> foundList = new ArrayList<>();
Files.walkFileTree( codeRoot, new SimpleFileVisitor<Path>() {
    @Override
    public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs)
             throws IOException {
        String rPath = codeRoot.relativize( dir ).toString();
        int level = rPath.length() - rPath.replace("/", "").length();
        if (dir.getFileName().toString().equals( "target" ) || level < CUTOFF) {
            return FileVisitResult.CONTINUE;
        }
        return FileVisitResult.SKIP_SUBTREE;
    }
    @Override
    public FileVisitResult visitFile( Path file, BasicFileAttributes attrs ) 
            throws IOException {
        if (file.getFileName().toString().endsWith( ".txt" )) {
            foundList.add( file );
        }
        return FileVisitResult.CONTINUE;
    }
} );

My goal is to add all files under a specific directory target that I know is at most CUTOFF levels under codeRoot.

I'm looking for a more efficient way to do this in terms of necessary stat() calls or someone saying "can't be done".

Language level is Java8.

like image 314
mabi Avatar asked Sep 30 '15 09:09

mabi


2 Answers

The algorithm presented is a one-time query. In this case, you're stuck with a linear-time search through all the directories. You can't minimize away the need to examine each directory that way. You can look at caching, of course, but if you're going to bother with cache coherency and need high performance you might also consider building an index. In either case I'll address the question you asked, which is about a one-time query.

The version of Files.walkFileTree you're using walks the entire tree, including all the files and directories past the maximum level. You're explicitly excluding them by parsing the path name, a technique you rightly think might not be efficient. The solution is to always read the documentation. There's a second version of Files.walkFileTree with maximum depth as an explicit argument. From a tutorial on walking the file tree:

The second walkFileTree method enables you to additionally specify a limit on the number of levels visited and a set of FileVisitOption enums.

If you use the second method you'll only visit candidate files within the maximum level, and you can avoid all the code that prunes subtrees.

like image 96
eh9 Avatar answered Sep 20 '22 23:09

eh9


Optimization options:

1) register for notification when directory changes: https://docs.oracle.com/javase/tutorial/essential/io/notification.html this can work in the background

2) (less optimal) use caching of not changed directories (in some filesystems): use the last modified time of the directory to cache directories which haven't changed since the last call

Using grepcode, I could not find how relativize is implemented, I think it might be implemented natively. I guess it is implemented with simple string operations of already pulled values and I don't think it is accessing stat() at all. You can test it although, make a dummy code (which doesn't work anything useful) with and without relativize and measure real impact when traversing a lot of files. Than you can be sure you are not losing much performance due to relativize

like image 32
Mladen Adamovic Avatar answered Sep 18 '22 23:09

Mladen Adamovic