Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to iterate over list of files

Tags:

java

I am searching for an efficient way to iterate over thousands of files in one or more directories.

The only way to iterate over files in a directory seems to be File.list*() functions. These functions effectively load the entire list of files in some sort of Collection and then let the user iterate over it. This seems to be impractical in terms of time/memory consumption. I tried looking at commons-io and other similar tools. but they all ultimately call File.list*() somewhere inside. JDK7's walkFileTree() came close, but I don't have control over when to pick the next element.

I have over 150,000 files in a directory and after many -Xms/-Xmm trial runs I got rid of memory overflow issues. But the time it takes to fill the array hasn't changed.

I wish to make some sort of an Iterable class that uses opendir()/closedir() like functions to lazily load file names as required. Is there a way to do this?

Update:

Java 7 NIO.2 supports file iteration via java.nio.file.DirectoryStream. It is an Iterable class. As for JDK6 and below, the only option is File.list*() methods.

like image 677
Unmanned Player Avatar asked Mar 28 '12 04:03

Unmanned Player


2 Answers

Here is an example of how to iterate over directory entries without having to store 159k of them in an array. Add error/exception/shutdown/timeout handling as necessary. This technique uses a secondary thread to load a small blocking queue.

Usage is:

FileWalker z = new FileWalker(new File("\\"), 1024); // start path, queue size
Iterator<Path> i = z.iterator();
while (i.hasNext()) {
  Path p = i.next();
}

The example:

public class FileWalker implements Iterator<Path> {
  final BlockingQueue<Path> bq;
  FileWalker(final File fileStart, final int size) throws Exception {
  bq = new ArrayBlockingQueue<Path>(size);
  Thread thread = new Thread(new Runnable() {
    public void run() {
      try {
        Files.walkFileTree(fileStart.toPath(), new FileVisitor<Path>() {
          public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
            return FileVisitResult.CONTINUE;
          }
          public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
            try {
              bq.offer(file, 4242, TimeUnit.HOURS);
            } catch (InterruptedException e) {
              e.printStackTrace();
            }
            return FileVisitResult.CONTINUE;
          }
          public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {
            return FileVisitResult.CONTINUE;
          }
          public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException {
            return FileVisitResult.CONTINUE;
          }
        });
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  });
  thread.setDaemon(true);
  thread.start();
  thread.join(200);
}
public Iterator<Path> iterator() {
  return this;
}
public boolean hasNext() {
  boolean hasNext = false;
  long dropDeadMS = System.currentTimeMillis() + 2000;
  while (System.currentTimeMillis() < dropDeadMS) {
    if (bq.peek() != null) {
      hasNext = true;
      break;
    }
    try {
      Thread.sleep(1);
    } catch (InterruptedException e) {
      e.printStackTrace();
    }
  }
  return hasNext;
}
public Path next() {
  Path path = null;
  try {
    path = bq.take();
  } catch (InterruptedException e) {
    e.printStackTrace();
  }
  return path;
}
public void remove() {
  throw new UnsupportedOperationException();
}
}
like image 186
Java42 Avatar answered Oct 15 '22 15:10

Java42


This seems to be impractical in terms of time/memory consumption.

Even 150,000 file won't consume an impractical amount of memory.

I wish to make some sort of an Iterable class that uses opendir()/closedir() like functions to lazily load file names as required. Is there a way to do this?

You would need to write or find a native code library in order to access those C functions. It is probably going to introduce more problems than it solves. My advice would be to just use File.list() and increase the heap size.


Actually, there's another hacky alternative. Use System.exec to run the ls command (or the windows equivalent) and write your iterator to read and parse the command output text. That avoids the nastiness associated with using native libraries from Java.

like image 33
Stephen C Avatar answered Oct 15 '22 14:10

Stephen C