I'm working on a reader/writer for DNG/TIFF files. As there are several options to work with files in general (FileInputStream
, FileChannel
, RandomAccessFile
), I'm wondering which strategy would fit my needs.
A DNG/TIFF file is a composition of:
The overall file size ranges from 15 MiB (compressed 14 bit raw data) up to about 100 MiB (uncompressed float data). The number of files to process is 50-400.
There are two usage patterns:
I'm currently using a FileChannel
and performing a map()
to obtain a MappedByteBuffer
covering the whole file. This seems quite wasteful if I'm just interested in reading the meta-data. Another problem is freeing the mapped memory: When I pass slices of the mapped buffer around for parsing etc. the underlying MappedByteBuffer
won't be collected.
I now decided to copy smaller chunks of FileChannel
using the several read()
-methods and only map the big raw-data regions. The downside is that reading a single value seems extremely complex, because there's no readShort()
and the like:
short readShort(long offset) throws IOException, InterruptedException {
return read(offset, Short.BYTES).getShort();
}
ByteBuffer read(long offset, long byteCount) throws IOException, InterruptedException {
ByteBuffer buffer = ByteBuffer.allocate(Math.toIntExact(byteCount));
buffer.order(GenericTiffFileReader.this.byteOrder);
GenericTiffFileReader.this.readInto(buffer, offset);
return buffer;
}
private void readInto(ByteBuffer buffer, long startOffset)
throws IOException, InterruptedException {
long offset = startOffset;
while (buffer.hasRemaining()) {
int bytesRead = this.channel.read(buffer, offset);
switch (bytesRead) {
case 0:
Thread.sleep(10);
break;
case -1:
throw new EOFException("unexpected end of file");
default:
offset += bytesRead;
}
}
buffer.flip();
}
RandomAccessFile
provides useful methods like readShort()
or readFully()
, but cannot handle little endian byte order.
So, is there an idiomatic way to handle scattered reads of single bytes and huge blocks? Is memory-mapping an entire 100 MiB file to just read a few hundred bytes wasteful or slow?
Ok, I finally did some rough benchmarks:
echo 3 > /proc/sys/vm/drop_caches
The sum of the files' sizes exceeded my installed system memory.
Method 1, FileChannel
and temporary ByteBuffer
s:
private static long method1(Path file, long dummyUsage) throws IOException, Error {
try (FileChannel channel = FileChannel.open(file, StandardOpenOption.READ)) {
for (int i = 0; i < 1000; i++) {
ByteBuffer dst = ByteBuffer.allocate(8);
if (channel.position(i * 10000).read(dst) != dst.capacity())
throw new Error("partial read");
dst.flip();
dummyUsage += dst.order(ByteOrder.LITTLE_ENDIAN).getInt();
dummyUsage += dst.order(ByteOrder.BIG_ENDIAN).getInt();
}
}
return dummyUsage;
}
Results:
1. 3422 ms
2. 56 ms
3. 24 ms
4. 24 ms
5. 27 ms
6. 25 ms
7. 23 ms
8. 23 ms
Method 2, MappedByteBuffer
covering the whole file:
private static long method2(Path file, long dummyUsage) throws IOException {
final MappedByteBuffer buffer;
try (FileChannel channel = FileChannel.open(file, StandardOpenOption.READ)) {
buffer = channel.map(MapMode.READ_ONLY, 0L, Files.size(file));
}
for (int i = 0; i < 1000; i++) {
dummyUsage += buffer.order(ByteOrder.LITTLE_ENDIAN).getInt(i * 10000);
dummyUsage += buffer.order(ByteOrder.BIG_ENDIAN).getInt(i * 10000 + 4);
}
return dummyUsage;
}
Results:
1. 749 ms
2. 21 ms
3. 17 ms
4. 16 ms
5. 18 ms
6. 13 ms
7. 15 ms
8. 17 ms
Method 3, RandomAccessFile
:
private static long method3(Path file, long dummyUsage) throws IOException {
try (RandomAccessFile raf = new RandomAccessFile(file.toFile(), "r")) {
for (int i = 0; i < 1000; i++) {
raf.seek(i * 10000);
dummyUsage += Integer.reverseBytes(raf.readInt());
raf.seek(i * 10000 + 4);
dummyUsage += raf.readInt();
}
}
return dummyUsage;
}
Results:
1. 3479 ms
2. 104 ms
3. 81 ms
4. 84 ms
5. 78 ms
6. 81 ms
7. 81 ms
8. 81 ms
Conclusion: The MappedByteBuffer
-method occupies more page cache memory (340 MB instead of 140 MB) but performs significantly better on the first and all following runs and seems to have the lowest overhead. And as a bonus this method provides a really comfortable interface regarding byte order, scattered small data and huge data blocks. RandomAccessFile
performs worst.
To answer my own question: A MappedByteBuffer
covering the whole file seems to be the idiomatic and fastest way to handle random access to big files without being wasteful regarding memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With