Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java 7's nio.file package is uber slow at creating new files

I'm trying to create 300M files from a java program, I switched from the old file API to the new java 7 nio package, but the new package is going even slower than the old one.

I see less CPU utilization than I did when I was using the old file API, but I'm running this simple code and I'm getting 0.5Mbytes/sec file transfer rates and the writes from java are reading off one disk and writing to another (the write is the only process accessing the disk).

Files.write(FileSystems.getDefault().getPath(filePath), fiveToTenKBytes, StandardOpenOption.CREATE);

Is there any hope of getting a reasonable throughput here?


Update:

I'm unpacking 300 million 5-10k byte image files from large files. I have 3 disks, 1 local and 2 SAN attached (all have a typical throughput rate of ~20MB/sec on large files).

I've also tried this code which improved speed to barely less than 2MB/sec throughput (9ish days to unpack these files).

ByteBuffer byteBuffer = ByteBuffer.wrap(imageBinary, 0, (BytesWritable)value).getLength());
FileOutputStream fos = new FileOutputStream( imageFile );
fos.getChannel().write(byteBuffer);
fos.close();

I read from the local disk and write to the SAN attached disk. I'm reading from a Hadoop SequenceFile format, hadoop is typically able to read these files at 20MB/sec using basically the same code.

The only thing that appears out of place, other than the uber slowness, is that I see more read IO than write IO by about 2:1, though the sequence file is gziped (images get virtually a 1:1 ratio though), so the compressed file should be approx. 1:1 with the output.


2nd UPDATE

Looking at iostat I see some odd numbers, we're looking at xvdf here, I have one java process reading from xvdb and writing to xvdf and no ohter processes active on xvdf

iostat -d 30
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvdap1            1.37         5.60         4.13        168        124
xvdb             14.80       620.00         0.00      18600          0
xvdap3            0.00         0.00         0.00          0          0
xvdf            668.50      2638.40       282.27      79152       8468
xvdg           1052.70      3751.87      2315.47     112556      69464

The reads on xvdf are 10x the writes, that's unbelievable.

fstab
/dev/xvdf       /mnt/ebs1       auto    defaults,noatime,nodiratime     0       0
/dev/xvdg       /mnt/ebs2       auto    defaults,noatime,nodiratime     0       0
like image 392
David Parks Avatar asked Mar 15 '13 13:03

David Parks


2 Answers

If I understood your code correctly, you're splitting/writing the 300M files in small chunks ("fiveToTenKBytes").

Consider to use a Stream approach.

If you're writing to a disk, consider to wrap the OutputStream with a BufferedOutputStream.

E.g. something like:

try (BufferedOutputStream bos = new BufferedOutputStream(Files.newOutputStream(Paths.getPath(filePathString), StandardOpenOption.CREATE))){

 ...

}
like image 131
Puce Avatar answered Nov 11 '22 07:11

Puce


I think your slowness is coming from creating new files, not actual transfer. I believe that creating a file is a synchronous operation in Linux: the system call will not return until the file has been created and the directory updated. This suggests a couple of things you can do:

  • Use multiple writer threads with a single reader thread. The reader thread will read data from the source file into a byte[], then create a Runnable that writes the output file from this array. Use a threadpool with lots of threads -- maybe 100 or more -- because they'll be spending most of their time waiting for the creat to complete. Set the capacity of this pool's inbound queue based on the amount of memory you have: if your files are 10k in size, then a queue capacity of 1,000 seems reasonable (there's no good reason to allow the reader to get too far ahead of the writers, so you could even go with a capacity of twice the number of threads).
  • Rather than NIO, use basic BufferedInputStreams and BufferedOutputStreams. Your problem here is syscalls, not memory speed (the NIO classes are designed to prevent copies between heap and off-heap memory).

I'm going to assume that you already know not to attempt to store all the files into a single directory. Or even store more than a few hundred files in one directory.

And as another alternative, have you considered S3 for storage? I'm guessing that its bucket keys are far more efficient than actual directories, and there is a filesystem that lets you access buckets as if they were files (haven't tried it myself).

like image 38
parsifal Avatar answered Nov 11 '22 07:11

parsifal