Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read scattered data from multiple files in java

I'm working on a reader/writer for DNG/TIFF files. As there are several options to work with files in general (FileInputStream, FileChannel, RandomAccessFile), I'm wondering which strategy would fit my needs.

A DNG/TIFF file is a composition of:

  • some (5-20) small blocks (several tens to hundred bytes)
  • very few (1-3) big continuous blocks of image data (up to 100 MiB)
  • several (maybe 20-50) very small blocks (4-16 bytes)

The overall file size ranges from 15 MiB (compressed 14 bit raw data) up to about 100 MiB (uncompressed float data). The number of files to process is 50-400.

There are two usage patterns:

  1. Read all meta-data from all files (everything except the image data)
  2. Read all image data from all files

I'm currently using a FileChannel and performing a map() to obtain a MappedByteBuffer covering the whole file. This seems quite wasteful if I'm just interested in reading the meta-data. Another problem is freeing the mapped memory: When I pass slices of the mapped buffer around for parsing etc. the underlying MappedByteBuffer won't be collected.

I now decided to copy smaller chunks of FileChannel using the several read()-methods and only map the big raw-data regions. The downside is that reading a single value seems extremely complex, because there's no readShort() and the like:

short readShort(long offset) throws IOException, InterruptedException {
    return read(offset, Short.BYTES).getShort();
}

ByteBuffer read(long offset, long byteCount) throws IOException, InterruptedException {
    ByteBuffer buffer = ByteBuffer.allocate(Math.toIntExact(byteCount));
    buffer.order(GenericTiffFileReader.this.byteOrder);
    GenericTiffFileReader.this.readInto(buffer, offset);
    return buffer;
}

private void readInto(ByteBuffer buffer, long startOffset)
        throws IOException, InterruptedException {

    long offset = startOffset;
    while (buffer.hasRemaining()) {
        int bytesRead = this.channel.read(buffer, offset);
        switch (bytesRead) {
        case 0:
            Thread.sleep(10);
            break;
        case -1:
            throw new EOFException("unexpected end of file");
        default:
            offset += bytesRead;
        }
    }
    buffer.flip();
}

RandomAccessFile provides useful methods like readShort() or readFully(), but cannot handle little endian byte order.

So, is there an idiomatic way to handle scattered reads of single bytes and huge blocks? Is memory-mapping an entire 100 MiB file to just read a few hundred bytes wasteful or slow?

like image 417
Kai Giebeler Avatar asked Nov 09 '22 01:11

Kai Giebeler


1 Answers

Ok, I finally did some rough benchmarks:

  1. Flush all read caches echo 3 > /proc/sys/vm/drop_caches
  2. Repeat 8 times: Read 1000 times 8 bytes from each file (about 20 files from 20 MiB up to 1 GiB).

The sum of the files' sizes exceeded my installed system memory.

Method 1, FileChannel and temporary ByteBuffers:

private static long method1(Path file, long dummyUsage) throws IOException, Error {
    try (FileChannel channel = FileChannel.open(file, StandardOpenOption.READ)) {

        for (int i = 0; i < 1000; i++) {
            ByteBuffer dst = ByteBuffer.allocate(8);

            if (channel.position(i * 10000).read(dst) != dst.capacity())
                throw new Error("partial read");
            dst.flip();
            dummyUsage += dst.order(ByteOrder.LITTLE_ENDIAN).getInt();
            dummyUsage += dst.order(ByteOrder.BIG_ENDIAN).getInt();
        }
    }
    return dummyUsage;
}

Results:

1. 3422 ms
2. 56 ms
3. 24 ms
4. 24 ms
5. 27 ms
6. 25 ms
7. 23 ms
8. 23 ms

Method 2, MappedByteBuffer covering the whole file:

private static long method2(Path file, long dummyUsage) throws IOException {

    final MappedByteBuffer buffer;
    try (FileChannel channel = FileChannel.open(file, StandardOpenOption.READ)) {
        buffer = channel.map(MapMode.READ_ONLY, 0L, Files.size(file));
    }
    for (int i = 0; i < 1000; i++) {
        dummyUsage += buffer.order(ByteOrder.LITTLE_ENDIAN).getInt(i * 10000);
        dummyUsage += buffer.order(ByteOrder.BIG_ENDIAN).getInt(i * 10000 + 4);
    }
    return dummyUsage;
}

Results:

1. 749 ms
2. 21 ms
3. 17 ms
4. 16 ms
5. 18 ms
6. 13 ms
7. 15 ms
8. 17 ms

Method 3, RandomAccessFile:

private static long method3(Path file, long dummyUsage) throws IOException {

    try (RandomAccessFile raf = new RandomAccessFile(file.toFile(), "r")) {
        for (int i = 0; i < 1000; i++) {

            raf.seek(i * 10000);
            dummyUsage += Integer.reverseBytes(raf.readInt());
            raf.seek(i * 10000 + 4);
            dummyUsage += raf.readInt();
        }
    }
    return dummyUsage;
}

Results:

1. 3479 ms
2. 104 ms
3. 81 ms
4. 84 ms
5. 78 ms
6. 81 ms
7. 81 ms
8. 81 ms

Conclusion: The MappedByteBuffer-method occupies more page cache memory (340 MB instead of 140 MB) but performs significantly better on the first and all following runs and seems to have the lowest overhead. And as a bonus this method provides a really comfortable interface regarding byte order, scattered small data and huge data blocks. RandomAccessFile performs worst.

To answer my own question: A MappedByteBuffer covering the whole file seems to be the idiomatic and fastest way to handle random access to big files without being wasteful regarding memory.

like image 139
Kai Giebeler Avatar answered Nov 14 '22 23:11

Kai Giebeler