Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Array Bulk Flush on Disk

Tags:

java

arrays

I have two arrays (int and long) which contains millions of entries. Until now, I am doing it using DataOutputStream and using a long buffer thus disk I/O costs gets low (nio is also more or less same as I have huge buffer, so I/O access cost low) specifically, using

DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream("abc.txt"),1024*1024*100));

for(int i = 0 ; i < 220000000 ; i++){
    long l = longarray[i];
    dos.writeLong(l);
}

But it takes several seconds (more than 5 minutes) to do that. Actually, what I want to bulk flush (some sort of main memory to disk memory map). For that, I found a nice approach in here and here. However, can't understand how to use that in my javac. Can anybody help me about that or any other way to do that nicely ?

like image 780
Arpssss Avatar asked Feb 25 '26 14:02

Arpssss


2 Answers

On my machine, 3.8 GHz i7 with an SSD

DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream("abc.txt"), 32 * 1024));

long start = System.nanoTime();
final int count = 220000000;
for (int i = 0; i < count; i++) {
    long l = i;
    dos.writeLong(l);
}
dos.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to write %,d longs%n",
        time / 1e9, count);

prints

Took 11.706 seconds to write 220,000,000 longs

Using memory mapped files

final int count = 220000000;

final FileChannel channel = new RandomAccessFile("abc.txt", "rw").getChannel();
MappedByteBuffer mbb = channel.map(FileChannel.MapMode.READ_WRITE, 0, count * 8);
mbb.order(ByteOrder.nativeOrder());

long start = System.nanoTime();
for (int i = 0; i < count; i++) {
    long l = i;
    mbb.putLong(l);
}
channel.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to write %,d longs%n",
        time / 1e9, count);

// Only works on Sun/HotSpot/OpenJDK to deallocate buffer.
((DirectBuffer) mbb).cleaner().clean();

final FileChannel channel2 = new RandomAccessFile("abc.txt", "r").getChannel();
MappedByteBuffer mbb2 = channel2.map(FileChannel.MapMode.READ_ONLY, 0, channel2.size());
mbb2.order(ByteOrder.nativeOrder());
assert mbb2.remaining() == count * 8;
long start2 = System.nanoTime();
for (int i = 0; i < count; i++) {
    long l = mbb2.getLong();
    if (i != l)
        throw new AssertionError("Expected "+i+" but got "+l);
}
channel.close();
long time2 = System.nanoTime() - start2;
System.out.printf("Took %.3f seconds to read %,d longs%n",
        time2 / 1e9, count);

// Only works on Sun/HotSpot/OpenJDK to deallocate buffer.
((DirectBuffer) mbb2).cleaner().clean();

prints on my 3.8 GHz i7.

Took 0.568 seconds to write 220,000,000 longs

on a slower machine prints

Took 1.180 seconds to write 220,000,000 longs
Took 0.990 seconds to read 220,000,000 longs

Is here any other way not to create that ? Because I have that array already on my main memory and I can't allocate more than 500 MB to do that?

This doesn't uses less than 1 KB of heap. If you look at how much memory is used before and after this call you will normally see no increase at all.

Another thing, is this gives efficient loading also means MappedByteBuffer?

In my experience, using a memory mapped file is by far the fastest because you reduce the number of system calls and copies into memory.

Because, in some article I found read(buffer) this gives better loading performance. (I check that one, really faster 220 million int array -float array read 5 seconds)

I would like to read that article because I have never seen that.

Another issue: readLong gives error while reading from your code output file

Part of the performance in provement is storing the values in native byte order. writeLong/readLong always uses big endian format which is much slower on Intel/AMD systems which are little endian format natively.

You can make the byte order big-endian which will slow it down or you can use native ordering (DataInput/OutputStream only supports big endian)

like image 86
Peter Lawrey Avatar answered Feb 28 '26 02:02

Peter Lawrey


I am running it server with 16GB memory with 2.13 GhZ [CPU]

I doubt the problem has anything to do with your Java code.

Your file system appears to be extraordinarily slow (at least ten times slower than what one would expect from a local disk).

I would do two things:

  1. Double check that you are actually writing to a local disk, and not to a network share. Bear in mind that in some environments home directories are NFS mounts.
  2. Ask your sysadmins to take a look at the machine to find out why the disk is so slow. If I were in their shoes, I'd start by checking the logs and running some benchmarks (e.g. using Bonnie++).
like image 30
NPE Avatar answered Feb 28 '26 04:02

NPE