Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to list a 2 million files directory in java without having an "out of memory" exception

I have to deal with a directory of about 2 million xml's to be processed.

I've already solved the processing distributing the work between machines and threads using queues and everything goes right.

But now the big problem is the bottleneck of reading the directory with the 2 million files in order to fill the queues incrementally.

I've tried using the File.listFiles() method, but it gives me a java out of memory: heap space exception. Any ideas?

like image 438
Fgblanch Avatar asked Jun 29 '10 08:06

Fgblanch


4 Answers

First of all, do you have any possibility to use Java 7? There you have a FileVisitor and the Files.walkFileTree, which should probably work within your memory constraints.

Otherwise, the only way I can think of is to use File.listFiles(FileFilter filter) with a filter that always returns false (ensuring that the full array of files is never kept in memory), but that catches the files to be processed along the way, and perhaps puts them in a producer/consumer queue or writes the file-names to disk for later traversal.

Alternatively, if you control the names of the files, or if they are named in some nice way, you could process the files in chunks using a filter that accepts filenames on the form file0000000-filefile0001000 then file0001000-filefile0002000 and so on.

If the names are not named in a nice way like this, you could try filtering them based on the hash-code of the file-name, which is supposed to be fairly evenly distributed over the set of integers.


Update: Sigh. Probably won't work. Just had a look at the listFiles implementation:

public File[] listFiles(FilenameFilter filter) {
    String ss[] = list();
    if (ss == null) return null;
    ArrayList v = new ArrayList();
    for (int i = 0 ; i < ss.length ; i++) {
        if ((filter == null) || filter.accept(this, ss[i])) {
            v.add(new File(ss[i], this));
        }
    }
    return (File[])(v.toArray(new File[v.size()]));
}

so it will probably fail at the first line anyway... Sort of disappointing. I believe your best option is to put the files in different directories.

Btw, could you give an example of a file name? Are they "guessable"? Like

for (int i = 0; i < 100000; i++)
    tryToOpen(String.format("file%05d", i))
like image 82
aioobe Avatar answered Oct 23 '22 18:10

aioobe


If Java 7 is not an option, this hack will work (for UNIX):

Process process = Runtime.getRuntime().exec(new String[]{"ls", "-f", "/path"});
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));
String line;
while (null != (line = reader.readLine())) {
    if (line.startsWith("."))
        continue;
    System.out.println(line);
}

The -f parameter will speed it up (from man ls):

-f     do not sort, enable -aU, disable -lst
like image 9
Jörn Horstmann Avatar answered Oct 23 '22 17:10

Jörn Horstmann


In case you can use Java 7 this can be done in this way and you won't have those out of memory problems.

Path path = FileSystems.getDefault().getPath("C:\\path\\with\\lots\\of\\files");
        Files.walkFileTree(path, new FileVisitor<Path>() {
            @Override
            public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
                return FileVisitResult.CONTINUE;
            }

            @Override
            public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
                // here you have the files to process
                System.out.println(file);
                return FileVisitResult.CONTINUE;
            }

            @Override
            public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {
               return FileVisitResult.TERMINATE;
            }

            @Override
            public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException {
              return FileVisitResult.CONTINUE;
            }
        });
like image 9
Jaime Hablutzel Avatar answered Oct 23 '22 17:10

Jaime Hablutzel


Use File.list() instead of File.listFiles() - the String objects it returns consume less memory than the File objects, and (more importantly, depending on the location of the directory) they don't contain the full path name.

Then, construct File objects as needed when processing the result.

However, this will not work for arbitrarily large directories either. It's an overall better idea to organize your files in a hierarchy of directories so that no single directory has more than a few thousand entries.

like image 8
Michael Borgwardt Avatar answered Oct 23 '22 17:10

Michael Borgwardt