Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

File Streaming in Java

I'm currently developing 3D graphics application using JOGL (Java OpenGL binding). In brief, I have a huge landscape binary file. Due to its size, I have to stream terrain chunks in the run-time. Therefore, we explicitly see the random access concern. I have already finished the first (and dirty :)) implementation (perhaps it is multi-threaded), where I'm using a foolish approach... Here is the initialization of it:

dataInputStream = new DataInputStream(new BufferedInputStream(fileInputStream,4 * 1024);
dataInputStream.mark(dataInputStream.available());

And when I need to read (stream) special chunk (I already know its "offset" in the file) I'm performing the following (shame on me :)):

dataInputStream.reset();
dataInputStream.skipBytes(offset);
dataInputStream.read(whatever I need...);

Since I had little experience that was the first thing I could think about :) So, until now I have read 3 useful and quite interesting articles (I'm suggesting you to read them, perhaps if you are interested in this topic)

  1. Byte Buffers and Non-Heap Memory - Mr. Gregory seems to be literate in Java NIO.

  2. Java tip: How to read files quickly [http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly] - That's an interesting benchmark.

  3. Articles: Tuning Java I/O Performance [http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/] - Simple Sun recommendations, but please scroll down and have a look at "Random Access" section there; they show a simple implementation of RandomAccessFile (RAF) with self-buffering improvement.

Mr. Gregory provides several *.java files in the end of his article. One of them is a benchmarking between FileChannel + ByteBuffer + Mapping (FBM) and RAF. He says that he noticed 4x speedup when using FBM compared to RAF. I have ran this benchmark in the following conditions:

  1. The offset (e. g. place of access) is generated randomly (in the file scope, e. g. 0 - file.length());
  2. File size is 220MB;
  3. 1 000 000 accesses (75% reads and 25% writes)

The results were stunning:

~ 28 sec for RAF! ~ 0.2 sec for FBM!

However, his implementation of RAF in this benchmark doesn't have self-buffering (the 3rd article tells about one), so I guess it is the "RandomAccessFile.seek" method calling, who drops performance so hard.

Ok, now after all those things I've learnt there is 1 question and 1 dilemma :)

Question: When we are mapping a file using "FileChannel.map" does Java copy the whole file contents into the MappedByteBuffer? Or does it just emulate it? If it copies, then using FBM approach is not suitable for my situation, is it?

Dilemma: Depends on your answers on the question...

  1. If mapping copies a file, then it seems like I have only 2 possible solutions to go: RAF + self-buffering (the one from the 3rd article) or make use of position in FileChannel (not with mapping)... Which one would be better?

  2. If mapping doesn't copy a file, then I have 3 options: two previous ones and FBM itself.

Edit: Here is one more question. Some of you here say that mapping doesn't copy file into MappedByteBuffer. Ok then, why can't I map 1GB file then, I'm getting "failed to map" message...

P. S. I would like to receive a fulfilled answer with advices, since I'm not able to find the consistent information over this topic in the internet.

Thanks :)

like image 413
Alexander Shukaev Avatar asked Jan 18 '11 20:01

Alexander Shukaev


3 Answers

No, the data is not buffered. A MappedByteBuffer references the data using a pointer. In other words, the data is not copied, it is simply mapped into physical memory. See the API docs if you haven't already.

A memory-mapped file is a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource. This resource is typically a file that is physically present on-disk, but can also be a device, shared memory object, or other resource that the operating system can reference through a file descriptor. Once present, this correlation between the file and the memory space permits applications to treat the mapped portion as if it were primary memory.

Source: Wikipedia

If you are going to be reading data quite frequently, it is a good idea to at least cache some of it.

like image 81
someguy Avatar answered Oct 07 '22 01:10

someguy


For a 220 MB file I would memory map the whole thing into virtual memory. The reason FBM is so fast is it doesn't actually read the data into memory, it just makes it available.

Note: when you run the test you need to compare like for like i.e. when the file is in the OS cache it will be much faster no matter how you do it. You need to repeat the test multiple times to get a reproduce able result.

like image 37
Peter Lawrey Avatar answered Oct 07 '22 01:10

Peter Lawrey


Have you noticed that if you run a program, then close it, then run it again it starts up much faster than the second time? This happens because the OS has cached the parts of the files that were accessed in the first run, and doesn't need to access the disk for them. Memory mapping a file essentially allows a program access to these buffers, thus minimizing copies made when reading it. Note that memory mapping a file does not cause it to be read whole into memory; the bits and pieces that you read are read from disk on-demand. If the OS determines that there is low memory, it may decide to free up some parts of the mapped file from memory, and leave them on disk.

Edit: What you want is FileInputStream.getChannel().map(), then adapt that to an InputStream, then connect that to the DataInputStream.

like image 45
Tassos Bassoukos Avatar answered Oct 07 '22 02:10

Tassos Bassoukos