Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to write millions of small text files in Java?

Tags:

java

file-io

I have to dump 6 million files which contain around 100-200 characters, and it's painfully slow. The actual slow part is the file writing, if I comment that part out (the call to the WriteSoveraFile method) the whole thing runs in 5-10 minutes. As it is, I ran it overnight (16 hours) and got done with 2 million records.

  1. is there any faster method?

  2. Would I be better off creating an array of arrays and then dumping it all at once? (my system only has 4 GB, wouldn't it die from the 6 GB of data consumed by this?)

Here is the procedure:

public static void WriteSoveraFile(String fileName, String path, String contents) throws IOException {

    BufferedWriter bw = null;

    try {
        String outputFolderPath = cloGetAsFile( GenCCD.o_OutER7Folder ).getAbsolutePath() ;
        File folder = new File( String.format("%1$s/Sovera/%2$s/", outputFolderPath, path) );  

        if (! folder.exists()) {
            folder.mkdirs();

/*          if (this.rcmdWriter != null)
              this.rcmdWriter.close();
*/        
        } 

        File file = new File( String.format("%1$s/%2$s", folder.getAbsolutePath(),fileName) );

        // if file doesnt exists, then create it
        if (!file.exists()) {
            file.createNewFile();
            FileWriter fw = new FileWriter(file.getAbsoluteFile());
            bw = new BufferedWriter(fw);
            bw.write(contents);
            bw.close();
        }
/*      else {
            file.delete();  // want to delete the file??  or just overwrite it??
            file.createNewFile();*/

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            if (bw != null) bw.close();
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }
}
like image 338
Rob Avatar asked Dec 06 '13 14:12

Rob


2 Answers

This is almost certainly an OS filesystem issue; writing lots of files simply is slow. I recommend writing a comparison test in shell and in C to get an idea of how much the OS is contributing. Additionally, I would suggest two major tweaks:

  • Ensure the system this is running on is using an SSD. Latency from seeking for filesystem journaling will be a major source of overhead.
  • Multithread your writing process. Serialized, the OS can't perform optimizations like batch operation writing, and the FileWriter may block on the close() operation.

(I was going to suggest looking into NIO, but the APIs don't seem to offer much benefit for your situation, since setting up an mmapped buffer would probably introduce more overhead than it would save for this size.)

like image 141
chrylis -cautiouslyoptimistic- Avatar answered Sep 18 '22 06:09

chrylis -cautiouslyoptimistic-


As has been mentioned, your limiting factor is storage access not your code or the JVM. There are a few things in your code that code be improved, but the changes would go unnoticed since the underlying bottleneck is the file IO.

There are some possible ways to speed up the process:

  • Write to a faster drive (higher RPM hard drive or an SSD -- NOT a USB drive because USB communication is much slower than SATA.)
  • Use multiple threads to write to a raided drive. There are RAID levels (can't remember which ones) that support concurrent writes.
  • Re-think the file structure such that it isn't necessary to have 6 million files. If the files are in a single location, I'm not sure why you need so many small files. The functionality could likely be accomplished by creating 1 or 2 larger file that takes all of the data. You would just need to change the format and the reading component. One file would be 200 chars * 2 * 6 million = ~2.4 GB (200 chars at 2 bytes/char times 6 million files).
like image 38
MadConan Avatar answered Sep 20 '22 06:09

MadConan