I have a simple problem: I iterate a large and deeply nested directory structure using Files.walkFileTree
like this:
final int CUTOFF = 5;
final List<Path> foundList = new ArrayList<>();
Files.walkFileTree( codeRoot, new SimpleFileVisitor<Path>() {
@Override
public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs)
throws IOException {
String rPath = codeRoot.relativize( dir ).toString();
int level = rPath.length() - rPath.replace("/", "").length();
if (dir.getFileName().toString().equals( "target" ) || level < CUTOFF) {
return FileVisitResult.CONTINUE;
}
return FileVisitResult.SKIP_SUBTREE;
}
@Override
public FileVisitResult visitFile( Path file, BasicFileAttributes attrs )
throws IOException {
if (file.getFileName().toString().endsWith( ".txt" )) {
foundList.add( file );
}
return FileVisitResult.CONTINUE;
}
} );
My goal is to add all files under a specific directory target
that I know is at most CUTOFF
levels under codeRoot
.
I'm looking for a more efficient way to do this in terms of necessary stat()
calls or someone saying "can't be done".
Language level is Java8.
The algorithm presented is a one-time query. In this case, you're stuck with a linear-time search through all the directories. You can't minimize away the need to examine each directory that way. You can look at caching, of course, but if you're going to bother with cache coherency and need high performance you might also consider building an index. In either case I'll address the question you asked, which is about a one-time query.
The version of Files.walkFileTree
you're using walks the entire tree, including all the files and directories past the maximum level. You're explicitly excluding them by parsing the path name, a technique you rightly think might not be efficient. The solution is to always read the documentation. There's a second version of Files.walkFileTree
with maximum depth as an explicit argument. From a tutorial on walking the file tree:
The second walkFileTree method enables you to additionally specify a limit on the number of levels visited and a set of FileVisitOption enums.
If you use the second method you'll only visit candidate files within the maximum level, and you can avoid all the code that prunes subtrees.
Optimization options:
1) register for notification when directory changes: https://docs.oracle.com/javase/tutorial/essential/io/notification.html this can work in the background
2) (less optimal) use caching of not changed directories (in some filesystems): use the last modified time of the directory to cache directories which haven't changed since the last call
Using grepcode, I could not find how relativize is implemented, I think it might be implemented natively. I guess it is implemented with simple string operations of already pulled values and I don't think it is accessing stat()
at all. You can test it although, make a dummy code (which doesn't work anything useful) with and without relativize
and measure real impact when traversing a lot of files. Than you can be sure you are not losing much performance due to relativize
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With