Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Suggestions for compression library to get byte[] as small as possible without considering cpu expense?

Correct me if I'm approaching this wrong, but I have a queue server and a bunch of java workers that I'm running on in a cluster. My queue has work units that are very small but there are many of them. So far my benchmarks and review of the workers has shown that I get about 200mb/second.

So I'm trying to figure out how to get more work units via my bandwidth. Currently my CPU usage is not very high(40-50%) because it can process the data faster than the network can send it. I want to get more work through the queue and am willing to pay for it via expensive compression/decompression(since half of each core is idle right now).

I have tried java LZO and gzip, but was wondering if there was anything better(even if its more cpu expensive)?

Updated: data is a byte[]. Basically the queue only takes it in that format so I am using ByteArrayOutputStream to write two ints and a int[] to to a byte[] format. The values in int[] are all ints between 0 to 100(or 1000 but the vast majority of the numbers are zeros). The lists are quite large anywhere from 1000 to 10,000 items(again, majority zeros..never more than 100 non-zero numbers in the int[])

like image 864
Lostsoul Avatar asked Dec 16 '22 00:12

Lostsoul


1 Answers

It sounds like using a custom compression mechanism that exploits the structure of the data could be very efficient.

Firstly, using a short[] (16 bit data type) instead of an int[] will halve (!) the amount of data sent, you can do this because the numbers are easily between -2^15 (-32768) and 2^15-1 (32767). This is ridiculously easy to implement.

Secondly, you could use a scheme similar to run-length encoding: a positive number represents that number literally, while a negative number represents that many zeros (after taking absolute values). e.g.

[10, 40, 0, 0, 0, 30, 0, 100, 0, 0, 0, 0] <=> [10, 40, -3, 30, -1, 100, -4]

This is harder to implement that just substituting short for int, but will provide ~80% compression in the very worst case (1000 numbers, 100 non-zero, none of which are consecutive).

I just did some simulations to work out the compression ratios. I tested the method I described above, and the one suggested by Louis Wasserman and sbridges. Both performed very well.

Assuming the length of the array and the number of non-zero numbers are both uniformly between their bounds, both methods save about 5400 ints (or shorts) on average with a compressed size of about 2.5% the original! The run-length encoding method seems to save about 1 additional int (or average compressed size that is 0.03% smaller), i.e. basically no difference, so you should use the one that is easiest to implement. The following are histograms of the compression ratios for 50000 random samples (they are very similar!).

histograms

Summary: using shorts instead of ints and one of the compression methods, you will be able to compress the data to about 1% of its original size!

For the simulation, I used the following R script:

SIZE <- 50000

lengths <- sample(1000:10000, SIZE, replace=T)
nonzeros <- sample(1:100, SIZE, replace=T)

f.rle <- function(len, nonzero) {
  indexes <- sort(c(0,sample(1:len, nonzero, F)))
  steps <- diff(indexes)
  sum(steps > 1) + nonzero # one short per run of zeros, and one per zero
}

f.index <- function(len, nonzero) {
  nonzero * 2
}

# using the [value, -1 * number of zeros,...] method
rle.comprs <- mapply(f.rle, lengths, nonzeros)
print(mean(lengths - rle.comprs)) # average number of shorts saved

rle.ratios <- rle.comprs / lengths * 100
print(mean(rle.ratios)) # average compression ratio

# using the [(index, value),...] method
index.comprs <- mapply(f.index, lengths, nonzeros)
print(mean(lengths - index.comprs)) # average number of shorts saved

index.ratios <- index.comprs / lengths * 100
print(mean(index.ratios)) # average compression ratio


par(mfrow=c(2,1))
hist(rle.ratios, breaks=100, freq=F, xlab="Compression ratio (%)", main="Run length encoding")
hist(index.ratios, breaks=100, freq=F, xlab="Compression ratio (%)", main="Store indices")
like image 159
huon Avatar answered Apr 26 '23 15:04

huon