For a high performance blocked bloom filter, I would like to align data to cache lines. (I know it's easier to do such tricks in C, but I would like to use Java.)
I do have a solution, but I'm not sure if it's correct, or if there is a better way. My solution tries to find the start of the cache line using the following algorithm:
Then, measure how fast this was, basically how many increments for a loop of 1 million (in each thread). My logic is, it is slower if the data is in a different cache line.
Here my code:
public static void main(String... args) {
for(int i=0; i<20; i++) {
int size = (int) (1000 + Math.random() * 1000);
byte[] data = new byte[size];
int cacheLineOffset = getCacheLineOffset(data);
System.out.println("offset: " + cacheLineOffset);
}
}
private static int getCacheLineOffset(byte[] data) {
for (int i = 0; i < 10; i++) {
int x = tryGetCacheLineOffset(data, i + 3);
if (x != -1) {
return x;
}
}
System.out.println("Cache line start not found");
return 0;
}
private static int tryGetCacheLineOffset(byte[] data, int testCount) {
// assume synchronization between two threads is faster(?)
// if each thread works on the same cache line
int[] counters = new int[64];
int testOffset = 8;
for (int test = 0; test < testCount; test++) {
for (int offset = 0; offset < 64; offset++) {
final int o = offset;
final Semaphore sema = new Semaphore(0);
Thread t = new Thread() {
public void run() {
try {
sema.acquire();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
for (int i = 0; i < 1000000; i++) {
data[o + testOffset] = data[o];
}
}
};
t.start();
sema.release();
data[o] = 1;
int counter = 0;
byte waitfor = 1;
for (int i = 0; i < 1000000; i++) {
byte x = data[o + testOffset];
if (x == waitfor) {
data[o]++;
counter++;
waitfor++;
}
}
try {
t.join();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
counters[offset] += counter;
}
}
Arrays.fill(data, 0, testOffset + 64, (byte) 0);
int low = Integer.MAX_VALUE, high = Integer.MIN_VALUE;
for (int i = 0; i < 64; i++) {
// average of 3
int avg3 = (counters[(i - 1 + 64) % 64] + counters[i] + counters[(i + 1) % 64]) / 3;
low = Math.min(low, avg3);
high = Math.max(high, avg3);
}
if (low * 1.1 > high) {
// no significant difference between low and high
return -1;
}
int lowCount = 0;
boolean[] isLow = new boolean[64];
for (int i = 0; i < 64; i++) {
if (counters[i] < (low + high) / 2) {
isLow[i] = true;
lowCount++;
}
}
if (lowCount != 8) {
// unclear
return -1;
}
for (int i = 0; i < 64; i++) {
if (isLow[(i - 1 + 64) % 64] && !isLow[i]) {
return i;
}
}
return -1;
}
It prints (example):
offset: 16
offset: 24
offset: 0
offset: 40
offset: 40
offset: 8
offset: 24
offset: 40
...
So arrays in Java seems to be aligned to 8 bytes.
You know that the GC can move objects... so your perfectly aligned array may get misaligned later.
I'd try ByteBuffer; I guess, a direct one gets aligned a lot (to a page boundary).
Unsafe can give you the address and with JNI, you can get an array pinned.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With