I have to deal with a directory of about 2 million xml's to be processed.
I've already solved the processing distributing the work between machines and threads using queues and everything goes right.
But now the big problem is the bottleneck of reading the directory with the 2 million files in order to fill the queues incrementally.
I've tried using the File.listFiles()
method, but it gives me a java out of memory: heap space
exception. Any ideas?
First of all, do you have any possibility to use Java 7? There you have a FileVisitor
and the Files.walkFileTree
, which should probably work within your memory constraints.
Otherwise, the only way I can think of is to use File.listFiles(FileFilter filter)
with a filter that always returns false
(ensuring that the full array of files is never kept in memory), but that catches the files to be processed along the way, and perhaps puts them in a producer/consumer queue or writes the file-names to disk for later traversal.
Alternatively, if you control the names of the files, or if they are named in some nice way, you could process the files in chunks using a filter that accepts filenames on the form file0000000
-filefile0001000
then file0001000
-filefile0002000
and so on.
If the names are not named in a nice way like this, you could try filtering them based on the hash-code of the file-name, which is supposed to be fairly evenly distributed over the set of integers.
Update: Sigh. Probably won't work. Just had a look at the listFiles implementation:
public File[] listFiles(FilenameFilter filter) {
String ss[] = list();
if (ss == null) return null;
ArrayList v = new ArrayList();
for (int i = 0 ; i < ss.length ; i++) {
if ((filter == null) || filter.accept(this, ss[i])) {
v.add(new File(ss[i], this));
}
}
return (File[])(v.toArray(new File[v.size()]));
}
so it will probably fail at the first line anyway... Sort of disappointing. I believe your best option is to put the files in different directories.
Btw, could you give an example of a file name? Are they "guessable"? Like
for (int i = 0; i < 100000; i++)
tryToOpen(String.format("file%05d", i))
If Java 7 is not an option, this hack will work (for UNIX):
Process process = Runtime.getRuntime().exec(new String[]{"ls", "-f", "/path"});
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));
String line;
while (null != (line = reader.readLine())) {
if (line.startsWith("."))
continue;
System.out.println(line);
}
The -f parameter will speed it up (from man ls
):
-f do not sort, enable -aU, disable -lst
In case you can use Java 7 this can be done in this way and you won't have those out of memory problems.
Path path = FileSystems.getDefault().getPath("C:\\path\\with\\lots\\of\\files");
Files.walkFileTree(path, new FileVisitor<Path>() {
@Override
public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
return FileVisitResult.CONTINUE;
}
@Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
// here you have the files to process
System.out.println(file);
return FileVisitResult.CONTINUE;
}
@Override
public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {
return FileVisitResult.TERMINATE;
}
@Override
public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException {
return FileVisitResult.CONTINUE;
}
});
Use File.list()
instead of File.listFiles()
- the String
objects it returns consume less memory than the File
objects, and (more importantly, depending on the location of the directory) they don't contain the full path name.
Then, construct File
objects as needed when processing the result.
However, this will not work for arbitrarily large directories either. It's an overall better idea to organize your files in a hierarchy of directories so that no single directory has more than a few thousand entries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With