Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently compute the hashCode for a BitSet-like implementation of Set<Integer>

I wonder, how to efficiently compute the hashCode for a BitSet-like implementation of Set<Integer>.

The BitSet#hashCode is obviously fast to compute, rather stupid(*) and incompatible with Set#hashCode().

A fast compatible implementation could go like

int hashCode() {
    int result = 0;
    for (int i=0; i<bits.length; ++i) {
        long word = bits[i];
        result += 64 * i * Long.bitCount(word) + weightedBitCount(word);
    }
    return result;
}

if there was an efficient implementation of

int weightedBitCount(long word) { // naive implementation
    int result = 0;
    for (int i=0; i<64; ++i) {
        if ((word & (1L << i)) != 0) {
            result += i;
        }
    }
    return result;
}

In case most bits are unset, the naive implementation could be improved by testing word==0 or using Long.highestOneBit or alike, but these tricks don't help much and are detrimental in other cases.

Is there a clever trick to significantly speed it up in general?

I guess, some batch computation over multiple words could be more efficient. Computing Set#size at the same time would be a nice bonus.

A note concerning premature optimization: Yes, I know. I'm mostly curious (and it can be useful for things like Project Euler).


(*) There are many bits which gets completely ignored (they get shifted out in the multiplication).

like image 763
maaartinus Avatar asked Feb 21 '18 02:02

maaartinus


Video Answer


2 Answers

Notwithstanding hardware support (e.g., the x86 popcnt instruction), counting the bits in O(1) time is a rather well-known algorithm, generally communicated as SWAR bitcount.

However, your algorithm has a custom kernel that adds a different value based on which bit is set:

result += loop_counter_value;

Lacking a pithy algorithm for bit counting with custom kernels, a tried and true methodology is to utilize precalculated results. In this context, a lookup table. Clearly, a lookup table of all combinations of 64 bits (264 combinations!) is unwieldy, but you can split the difference by precalculating each byte of the n-byte variable. For 8 bytes, this is 256*8, or 2KiB of memory. Consider:

int weightedBitCount(long word) {
    int result = 0;
    int[][] lookups = {
        {0, 0, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 4, 4, 5, 5, 6, 6, 7, 7, 7, 7, 8, 8, 9, 9, 10, 10, 5, 5, 6, 6, 7, 7, 8, 8, 8, 8, 9, 9, 10, 10, 11, 11, 9, 9, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 14, 14, 15, 15, 6, 6, 7, 7, 8, 8, 9, 9, 9, 9, 10, 10, 11, 11, 12, 12, 10, 10, 11, 11, 12, 12, 13, 13, 13, 13, 14, 14, 15, 15, 16, 16, 11, 11, 12, 12, 13, 13, 14, 14, 14, 14, 15, 15, 16, 16, 17, 17, 15, 15, 16, 16, 17, 17, 18, 18, 18, 18, 19, 19, 20, 20, 21, 21, 7, 7, 8, 8, 9, 9, 10, 10, 10, 10, 11, 11, 12, 12, 13, 13, 11, 11, 12, 12, 13, 13, 14, 14, 14, 14, 15, 15, 16, 16, 17, 17, 12, 12, 13, 13, 14, 14, 15, 15, 15, 15, 16, 16, 17, 17, 18, 18, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19, 20, 20, 21, 21, 22, 22, 13, 13, 14, 14, 15, 15, 16, 16, 16, 16, 17, 17, 18, 18, 19, 19, 17, 17, 18, 18, 19, 19, 20, 20, 20, 20, 21, 21, 22, 22, 23, 23, 18, 18, 19, 19, 20, 20, 21, 21, 21, 21, 22, 22, 23, 23, 24, 24, 22, 22, 23, 23, 24, 24, 25, 25, 25, 25, 26, 26, 27, 27, 28, 28},
        {0, 8, 9, 17, 10, 18, 19, 27, 11, 19, 20, 28, 21, 29, 30, 38, 12, 20, 21, 29, 22, 30, 31, 39, 23, 31, 32, 40, 33, 41, 42, 50, 13, 21, 22, 30, 23, 31, 32, 40, 24, 32, 33, 41, 34, 42, 43, 51, 25, 33, 34, 42, 35, 43, 44, 52, 36, 44, 45, 53, 46, 54, 55, 63, 14, 22, 23, 31, 24, 32, 33, 41, 25, 33, 34, 42, 35, 43, 44, 52, 26, 34, 35, 43, 36, 44, 45, 53, 37, 45, 46, 54, 47, 55, 56, 64, 27, 35, 36, 44, 37, 45, 46, 54, 38, 46, 47, 55, 48, 56, 57, 65, 39, 47, 48, 56, 49, 57, 58, 66, 50, 58, 59, 67, 60, 68, 69, 77, 15, 23, 24, 32, 25, 33, 34, 42, 26, 34, 35, 43, 36, 44, 45, 53, 27, 35, 36, 44, 37, 45, 46, 54, 38, 46, 47, 55, 48, 56, 57, 65, 28, 36, 37, 45, 38, 46, 47, 55, 39, 47, 48, 56, 49, 57, 58, 66, 40, 48, 49, 57, 50, 58, 59, 67, 51, 59, 60, 68, 61, 69, 70, 78, 29, 37, 38, 46, 39, 47, 48, 56, 40, 48, 49, 57, 50, 58, 59, 67, 41, 49, 50, 58, 51, 59, 60, 68, 52, 60, 61, 69, 62, 70, 71, 79, 42, 50, 51, 59, 52, 60, 61, 69, 53, 61, 62, 70, 63, 71, 72, 80, 54, 62, 63, 71, 64, 72, 73, 81, 65, 73, 74, 82, 75, 83, 84, 92},
        {0, 16, 17, 33, 18, 34, 35, 51, 19, 35, 36, 52, 37, 53, 54, 70, 20, 36, 37, 53, 38, 54, 55, 71, 39, 55, 56, 72, 57, 73, 74, 90, 21, 37, 38, 54, 39, 55, 56, 72, 40, 56, 57, 73, 58, 74, 75, 91, 41, 57, 58, 74, 59, 75, 76, 92, 60, 76, 77, 93, 78, 94, 95, 111, 22, 38, 39, 55, 40, 56, 57, 73, 41, 57, 58, 74, 59, 75, 76, 92, 42, 58, 59, 75, 60, 76, 77, 93, 61, 77, 78, 94, 79, 95, 96, 112, 43, 59, 60, 76, 61, 77, 78, 94, 62, 78, 79, 95, 80, 96, 97, 113, 63, 79, 80, 96, 81, 97, 98, 114, 82, 98, 99, 115, 100, 116, 117, 133, 23, 39, 40, 56, 41, 57, 58, 74, 42, 58, 59, 75, 60, 76, 77, 93, 43, 59, 60, 76, 61, 77, 78, 94, 62, 78, 79, 95, 80, 96, 97, 113, 44, 60, 61, 77, 62, 78, 79, 95, 63, 79, 80, 96, 81, 97, 98, 114, 64, 80, 81, 97, 82, 98, 99, 115, 83, 99, 100, 116, 101, 117, 118, 134, 45, 61, 62, 78, 63, 79, 80, 96, 64, 80, 81, 97, 82, 98, 99, 115, 65, 81, 82, 98, 83, 99, 100, 116, 84, 100, 101, 117, 102, 118, 119, 135, 66, 82, 83, 99, 84, 100, 101, 117, 85, 101, 102, 118, 103, 119, 120, 136, 86, 102, 103, 119, 104, 120, 121, 137, 105, 121, 122, 138, 123, 139, 140, 156},
        {0, 24, 25, 49, 26, 50, 51, 75, 27, 51, 52, 76, 53, 77, 78, 102, 28, 52, 53, 77, 54, 78, 79, 103, 55, 79, 80, 104, 81, 105, 106, 130, 29, 53, 54, 78, 55, 79, 80, 104, 56, 80, 81, 105, 82, 106, 107, 131, 57, 81, 82, 106, 83, 107, 108, 132, 84, 108, 109, 133, 110, 134, 135, 159, 30, 54, 55, 79, 56, 80, 81, 105, 57, 81, 82, 106, 83, 107, 108, 132, 58, 82, 83, 107, 84, 108, 109, 133, 85, 109, 110, 134, 111, 135, 136, 160, 59, 83, 84, 108, 85, 109, 110, 134, 86, 110, 111, 135, 112, 136, 137, 161, 87, 111, 112, 136, 113, 137, 138, 162, 114, 138, 139, 163, 140, 164, 165, 189, 31, 55, 56, 80, 57, 81, 82, 106, 58, 82, 83, 107, 84, 108, 109, 133, 59, 83, 84, 108, 85, 109, 110, 134, 86, 110, 111, 135, 112, 136, 137, 161, 60, 84, 85, 109, 86, 110, 111, 135, 87, 111, 112, 136, 113, 137, 138, 162, 88, 112, 113, 137, 114, 138, 139, 163, 115, 139, 140, 164, 141, 165, 166, 190, 61, 85, 86, 110, 87, 111, 112, 136, 88, 112, 113, 137, 114, 138, 139, 163, 89, 113, 114, 138, 115, 139, 140, 164, 116, 140, 141, 165, 142, 166, 167, 191, 90, 114, 115, 139, 116, 140, 141, 165, 117, 141, 142, 166, 143, 167, 168, 192, 118, 142, 143, 167, 144, 168, 169, 193, 145, 169, 170, 194, 171, 195, 196, 220},
        {0, 32, 33, 65, 34, 66, 67, 99, 35, 67, 68, 100, 69, 101, 102, 134, 36, 68, 69, 101, 70, 102, 103, 135, 71, 103, 104, 136, 105, 137, 138, 170, 37, 69, 70, 102, 71, 103, 104, 136, 72, 104, 105, 137, 106, 138, 139, 171, 73, 105, 106, 138, 107, 139, 140, 172, 108, 140, 141, 173, 142, 174, 175, 207, 38, 70, 71, 103, 72, 104, 105, 137, 73, 105, 106, 138, 107, 139, 140, 172, 74, 106, 107, 139, 108, 140, 141, 173, 109, 141, 142, 174, 143, 175, 176, 208, 75, 107, 108, 140, 109, 141, 142, 174, 110, 142, 143, 175, 144, 176, 177, 209, 111, 143, 144, 176, 145, 177, 178, 210, 146, 178, 179, 211, 180, 212, 213, 245, 39, 71, 72, 104, 73, 105, 106, 138, 74, 106, 107, 139, 108, 140, 141, 173, 75, 107, 108, 140, 109, 141, 142, 174, 110, 142, 143, 175, 144, 176, 177, 209, 76, 108, 109, 141, 110, 142, 143, 175, 111, 143, 144, 176, 145, 177, 178, 210, 112, 144, 145, 177, 146, 178, 179, 211, 147, 179, 180, 212, 181, 213, 214, 246, 77, 109, 110, 142, 111, 143, 144, 176, 112, 144, 145, 177, 146, 178, 179, 211, 113, 145, 146, 178, 147, 179, 180, 212, 148, 180, 181, 213, 182, 214, 215, 247, 114, 146, 147, 179, 148, 180, 181, 213, 149, 181, 182, 214, 183, 215, 216, 248, 150, 182, 183, 215, 184, 216, 217, 249, 185, 217, 218, 250, 219, 251, 252, 284},
        {0, 40, 41, 81, 42, 82, 83, 123, 43, 83, 84, 124, 85, 125, 126, 166, 44, 84, 85, 125, 86, 126, 127, 167, 87, 127, 128, 168, 129, 169, 170, 210, 45, 85, 86, 126, 87, 127, 128, 168, 88, 128, 129, 169, 130, 170, 171, 211, 89, 129, 130, 170, 131, 171, 172, 212, 132, 172, 173, 213, 174, 214, 215, 255, 46, 86, 87, 127, 88, 128, 129, 169, 89, 129, 130, 170, 131, 171, 172, 212, 90, 130, 131, 171, 132, 172, 173, 213, 133, 173, 174, 214, 175, 215, 216, 256, 91, 131, 132, 172, 133, 173, 174, 214, 134, 174, 175, 215, 176, 216, 217, 257, 135, 175, 176, 216, 177, 217, 218, 258, 178, 218, 219, 259, 220, 260, 261, 301, 47, 87, 88, 128, 89, 129, 130, 170, 90, 130, 131, 171, 132, 172, 173, 213, 91, 131, 132, 172, 133, 173, 174, 214, 134, 174, 175, 215, 176, 216, 217, 257, 92, 132, 133, 173, 134, 174, 175, 215, 135, 175, 176, 216, 177, 217, 218, 258, 136, 176, 177, 217, 178, 218, 219, 259, 179, 219, 220, 260, 221, 261, 262, 302, 93, 133, 134, 174, 135, 175, 176, 216, 136, 176, 177, 217, 178, 218, 219, 259, 137, 177, 178, 218, 179, 219, 220, 260, 180, 220, 221, 261, 222, 262, 263, 303, 138, 178, 179, 219, 180, 220, 221, 261, 181, 221, 222, 262, 223, 263, 264, 304, 182, 222, 223, 263, 224, 264, 265, 305, 225, 265, 266, 306, 267, 307, 308, 348},
        {0, 48, 49, 97, 50, 98, 99, 147, 51, 99, 100, 148, 101, 149, 150, 198, 52, 100, 101, 149, 102, 150, 151, 199, 103, 151, 152, 200, 153, 201, 202, 250, 53, 101, 102, 150, 103, 151, 152, 200, 104, 152, 153, 201, 154, 202, 203, 251, 105, 153, 154, 202, 155, 203, 204, 252, 156, 204, 205, 253, 206, 254, 255, 303, 54, 102, 103, 151, 104, 152, 153, 201, 105, 153, 154, 202, 155, 203, 204, 252, 106, 154, 155, 203, 156, 204, 205, 253, 157, 205, 206, 254, 207, 255, 256, 304, 107, 155, 156, 204, 157, 205, 206, 254, 158, 206, 207, 255, 208, 256, 257, 305, 159, 207, 208, 256, 209, 257, 258, 306, 210, 258, 259, 307, 260, 308, 309, 357, 55, 103, 104, 152, 105, 153, 154, 202, 106, 154, 155, 203, 156, 204, 205, 253, 107, 155, 156, 204, 157, 205, 206, 254, 158, 206, 207, 255, 208, 256, 257, 305, 108, 156, 157, 205, 158, 206, 207, 255, 159, 207, 208, 256, 209, 257, 258, 306, 160, 208, 209, 257, 210, 258, 259, 307, 211, 259, 260, 308, 261, 309, 310, 358, 109, 157, 158, 206, 159, 207, 208, 256, 160, 208, 209, 257, 210, 258, 259, 307, 161, 209, 210, 258, 211, 259, 260, 308, 212, 260, 261, 309, 262, 310, 311, 359, 162, 210, 211, 259, 212, 260, 261, 309, 213, 261, 262, 310, 263, 311, 312, 360, 214, 262, 263, 311, 264, 312, 313, 361, 265, 313, 314, 362, 315, 363, 364, 412},
        {0, 56, 57, 113, 58, 114, 115, 171, 59, 115, 116, 172, 117, 173, 174, 230, 60, 116, 117, 173, 118, 174, 175, 231, 119, 175, 176, 232, 177, 233, 234, 290, 61, 117, 118, 174, 119, 175, 176, 232, 120, 176, 177, 233, 178, 234, 235, 291, 121, 177, 178, 234, 179, 235, 236, 292, 180, 236, 237, 293, 238, 294, 295, 351, 62, 118, 119, 175, 120, 176, 177, 233, 121, 177, 178, 234, 179, 235, 236, 292, 122, 178, 179, 235, 180, 236, 237, 293, 181, 237, 238, 294, 239, 295, 296, 352, 123, 179, 180, 236, 181, 237, 238, 294, 182, 238, 239, 295, 240, 296, 297, 353, 183, 239, 240, 296, 241, 297, 298, 354, 242, 298, 299, 355, 300, 356, 357, 413, 63, 119, 120, 176, 121, 177, 178, 234, 122, 178, 179, 235, 180, 236, 237, 293, 123, 179, 180, 236, 181, 237, 238, 294, 182, 238, 239, 295, 240, 296, 297, 353, 124, 180, 181, 237, 182, 238, 239, 295, 183, 239, 240, 296, 241, 297, 298, 354, 184, 240, 241, 297, 242, 298, 299, 355, 243, 299, 300, 356, 301, 357, 358, 414, 125, 181, 182, 238, 183, 239, 240, 296, 184, 240, 241, 297, 242, 298, 299, 355, 185, 241, 242, 298, 243, 299, 300, 356, 244, 300, 301, 357, 302, 358, 359, 415, 186, 242, 243, 299, 244, 300, 301, 357, 245, 301, 302, 358, 303, 359, 360, 416, 246, 302, 303, 359, 304, 360, 361, 417, 305, 361, 362, 418, 363, 419, 420, 476}
    };
    for (int bite = 0; bite < 8; bite++)
        result += lookups[bite][ (int)(word >> (bite * 8)) & 0xff ];
    return result;
}

You might clean that up a bit, move the initialization out of the function, and so on, but this crucially removes the branch from your loop, and reduces your best case of 128 instructions (for all zeros) to 56 instructions (rough numbers). The worst case is a bit more pronounced at 192 instructions to 56. Further, a smart compiler might unroll the loop entirely, reducing to 40 instructions.

like image 67
hunteke Avatar answered Oct 11 '22 22:10

hunteke


I think is also important to have less hash collisions together with the hashing performance. Faster hashing calculation can make your program generally slower, because of big amount of hash misses.

It mind be a better idea to use some generic hash function like MurMur3A from Google Guava, instead of inventing your own.

There are many Benchmark about hashing, for example:

  • http://greenrobot.org/essentials/features/performant-hash-functions-for-java/comparison-of-hash-functions/
  • https://www.strchr.com/hash_functions
  • https://github.com/Cyan4973/xxHash

I think you can do some micro-benchmarking using a Google Caliper and check which hash function is better for you case.

BTW. Ask your self why do you need a custom BitSet ?

like image 27
Victor Gubin Avatar answered Oct 11 '22 23:10

Victor Gubin