Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What about buffering FileInputStream?

Tags:

java

file-io

I have a piece of code that reads hell of a lot (hundreds of thousand) of relatively small files (couple of KB) from the local file system in a loop. For each file there is a java.io.FileInputStream created to read the content. The process its very slow and take ages.

Do you think that wrapping the FIS into java.io.BufferedInputStream would make a significant difference?

like image 722
Tomasz Błachowicz Avatar asked May 21 '10 12:05

Tomasz Błachowicz


3 Answers

If you aren't already using a byte[] buffer of a decent size in the read/write loop (the latest implementation of BufferedInputStream uses 8KB), then it will certainly make difference. Give it a try yourself. Don't forget to make any OutputStream a BufferedOutputStream as well.

But if you already have buffered it using a byte[] and/or it after all makes only little difference, then you've hit the harddisk and I/O controller speed as the bottleneck.

like image 168
BalusC Avatar answered Oct 17 '22 00:10

BalusC


I very much doubt whether that will make any difference.

Your fundamental problem is the hundreds of throusands of tiny files. Reading those is going to make the disk thrash and take forever, no matter how you do it, you'll spend 99,9% of the time waiting on mechanical movement inside the harddisk.

There are two ways to fix this:

  • Save your data on an SSD - they have much lower (as in five orders of magnitude less) latency.
  • Rearrange your data into few large files and read those sequentially
like image 45
Michael Borgwardt Avatar answered Oct 17 '22 00:10

Michael Borgwardt


That depends on how you're reading the data. If you're reading from the FileInputStream in a very inefficient way (for example, calling read() byte-by-byte), then using a BufferedInputStream could improve things dramatically. But if you're already using a reasonable-sized buffer with FileInputStream, switching to a BufferedInputStream won't matter.

Since you're talking a large number of very small files, there's a strong possibility that a lot of the delay is due to directory operations (open, close), not the actual reading of bytes from the files.

like image 3
David Gelhar Avatar answered Oct 17 '22 00:10

David Gelhar