Efficently find files in specific directories

Question

I have a simple problem: I iterate a large and deeply nested directory structure using Files.walkFileTree like this:

final int CUTOFF = 5;
final List<Path> foundList = new ArrayList<>();
Files.walkFileTree( codeRoot, new SimpleFileVisitor<Path>() {
    @Override
    public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs)
             throws IOException {
        String rPath = codeRoot.relativize( dir ).toString();
        int level = rPath.length() - rPath.replace("/", "").length();
        if (dir.getFileName().toString().equals( "target" ) || level < CUTOFF) {
            return FileVisitResult.CONTINUE;
        }
        return FileVisitResult.SKIP_SUBTREE;
    }
    @Override
    public FileVisitResult visitFile( Path file, BasicFileAttributes attrs ) 
            throws IOException {
        if (file.getFileName().toString().endsWith( ".txt" )) {
            foundList.add( file );
        }
        return FileVisitResult.CONTINUE;
    }
} );

My goal is to add all files under a specific directory target that I know is at most CUTOFF levels under codeRoot.

I'm looking for a more efficient way to do this in terms of necessary stat() calls or someone saying "can't be done".

Language level is Java8.

eh9 · Accepted Answer

The algorithm presented is a one-time query. In this case, you're stuck with a linear-time search through all the directories. You can't minimize away the need to examine each directory that way. You can look at caching, of course, but if you're going to bother with cache coherency and need high performance you might also consider building an index. In either case I'll address the question you asked, which is about a one-time query.

The version of Files.walkFileTree you're using walks the entire tree, including all the files and directories past the maximum level. You're explicitly excluding them by parsing the path name, a technique you rightly think might not be efficient. The solution is to always read the documentation. There's a second version of Files.walkFileTree with maximum depth as an explicit argument. From a tutorial on walking the file tree:

The second walkFileTree method enables you to additionally specify a limit on the number of levels visited and a set of FileVisitOption enums.

If you use the second method you'll only visit candidate files within the maximum level, and you can avoid all the code that prunes subtrees.

Mladen Adamovic · Answer

Optimization options:

1) register for notification when directory changes: https://docs.oracle.com/javase/tutorial/essential/io/notification.html this can work in the background

2) (less optimal) use caching of not changed directories (in some filesystems): use the last modified time of the directory to cache directories which haven't changed since the last call

Using grepcode, I could not find how relativize is implemented, I think it might be implemented natively. I guess it is implemented with simple string operations of already pulled values and I don't think it is accessing stat() at all. You can test it although, make a dummy code (which doesn't work anything useful) with and without relativize and measure real impact when traversing a lot of files. Than you can be sure you are not losing much performance due to relativize

Efficently find files in specific directories

Tags:

java

nio

mabi

2 Answers

eh9

Mladen Adamovic

Recent Activity

Donate For Us

Efficently find files in specific directories

Tags:

java

nio

mabi

2 Answers

eh9

Mladen Adamovic

Related questions

Recent Activity

Donate For Us