Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterating files in scala/java in O(1) open file descriptors

Tags:

java

scala

nio

It appears that nio's .list returns a stream which when consumed, holds on to one file descriptor per file iterated, until .close is called on the entire stream. This means that data directories with upwards of 1,000 files can easily brush against common ulimit values. The overall effect of this file descriptor accumulation, further exacerbates when dealing with nested traversals.

What might be an alternative way to iterate over the files of large directories, other than going down to spawning calls to the OS file list command? It would be cool if iterating the files of a large directory, a file descriptor would be maintained only per the currently iterated file, as implied by proper stream semantics.

Edit:

list returns a java Stream of java.nio.file.Path Which api call would be used for closing each item on the stream once it's been processed, rather than only when the entire stream is being closed, for leaner iteration? In scala, this can be easily fiddled using the api wrapper from better-files, leading from here.

like image 219
matanster Avatar asked Jan 17 '16 12:01

matanster


2 Answers

If that happens why not to use old school java.io.File?

File folder = new File(pathToFolder);
String[] files = folder.list();

tested with lsof and it looks like no of the listed files is open. You can convert the array to a list or stream afterwards. Unless the directory is too large or remote, then I would try to blame Path objects and garbage-collect or somehow destroy them.

like image 135
tomasb Avatar answered Sep 19 '22 23:09

tomasb


I ran into the same issue (on Windows Server 2012 R2) when I didn't close the stream. All the files I iterated over were open in read mode until the JVM was shut down. However, it did not occur on Mac OS X and since the stream depends on OS-dependent implementations of FileSystemProvider and DirectoryStream, I assume the issue can be OS-dependent, too.

Contrary to the @Ian McLaird comment, it is mentioned in the Files.list() documentation that

If timely disposal of file system resources is required, the try-with-resources construct should be used to ensure that the stream's close method is invoked after the stream operations are completed.

The returned stream is a DirectoryStream, whose Javadoc says:

A DirectoryStream is opened upon creation and is closed by invoking the close method. Closing a directory stream releases any resources associated with the stream. Failure to close the stream may result in a resource leak.

My solution was to follow the advice and use the try-with-resources construct

try (Stream<Path> fileListing = Files.list(directoryPath)) {
    // use the fileListing stream
}

When I closed the stream properly (used the above try-with-resources construct), the file handles were immediately released.

If you don't care about getting the files as a stream or you are OK with loading the whole file list into memory and convert it to a stream yourself, you can use the IO API:

File directory = new File("/path/to/dir");
File[] files = directory.listFiles();
if (files != null) { // 'files' can be null if 'directory' "does not denote a directory, or if an I/O error occurs."
    // use the 'files' array or convert to a stream:
    Stream<File> fileStream = Arrays.stream(files);
}

I did not experience any file-locking issues with this one. However, note that both solutions rely on native, OS-dependent code, so I advise testing in all environments you would be using.

like image 39
Adam Michalik Avatar answered Sep 19 '22 23:09

Adam Michalik