Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to XOR A LOT of binary arrays in python?

I am tasked with calculating hamming distances between 1D binary arrays in two groups - a group of 3000 arrays and a group of 10000 arrays, and every array is 100 items(bits) long. So thats 3000x10000 HD calculations on 100 bit long objects.And all that must be done in at most a dozen minutes

Here's the best of what I came up with

#X - 3000 by 100 bool np.array 
#Y - 10000 by 100 bool np.array
hd = []
i=1
for x in X:
    print("object nr " + str(i) + "/" + str(len(X)))
    arr = np.array([x] * len(Y)) 
    C = Y^arr # just xor this array by all the arrays in the other group simultainously
    hd.append([sum(c) for c in C]) #add up all the bits to get the hamming distance
    i+=1

return np.array(hd)

And it's still going to take 1-1.5 hours for it to finish. How do I go about making this faster?

like image 393
AlanKalane Avatar asked May 31 '18 01:05

AlanKalane


3 Answers

You should be able to dramatically improve the summing speed by using numpy to perform it, rather than using a list comprehension and the built-in sum function (that takes no advantage of numpy vectorized operations).

Just replace:

hd.append([sum(c) for c in C])

with:

# Explicitly use uint16 to reduce memory cost; if array sizes might increase
# you can use uint32 to leave some wiggle room
hd.append(C.sum(1, dtype=np.uint16))

which, for a 2D array, will return a new 1D array where each value is the sum of the corresponding row (thanks to specifying it should operate on axis 1). For example:

>>> arr = np.array([[True,False,True], [False,False,True], [True, True,True]], dtype=np.bool)
>>> arr.sum(1, np.uint16)
array([ 2, 1, 3], dtype=uint16)

Since it performs all the work at the C layer in a single operation without type conversions (instead of your original approach that requires a Python level loop that operates on each row, then an implicit loop that, while at the C layer, must still implicitly convert each numpy value one by one from np.bool to Python level ints just to sum them), this should run substantially faster for the array scales you're describing.

Side-note: While not the source of your performance problems, there is no reason to manually maintain your index value; enumerate can do that more quickly and easily. Simply replace:

i=1
for x in X:
    ... rest of loop ...
    i+=1

with:

for i, x in enumerate(X, 1):
    ... rest of loop ...

and you'll get the same behavior, but slightly faster, more concise and cleaner in general.

like image 52
ShadowRanger Avatar answered Oct 22 '22 11:10

ShadowRanger


IIUC, you can use np.logical_xor and list comprehension:

result = np.array([[np.logical_xor(X[a], Y[b].T).sum() for b in range(len(Y))] 
                                                       for a in range(len(X))])

The whole operation runs in 7 seconds in my machine.

0:00:07.226645
like image 37
rafaelc Avatar answered Oct 22 '22 12:10

rafaelc


Just in case you are not limited to using Python, this is a solution in C++ using bitset:

#include <iostream>
#include <bitset>
#include <vector>
#include <random>
#include <chrono>

using real = double;

std::mt19937_64 rng;
std::uniform_real_distribution<real> bitset_dist(0, 1);
real prob(0.75);

std::bitset<100> rand_bitset()
{
    std::bitset<100> bitset;
    for (size_t idx = 0; idx < bitset.size(); ++idx)
    {
         bitset[idx] = (bitset_dist(rng) < prob) ? true : false;
    }
    return std::move(bitset);
}

int main()
{
    rng.seed(std::chrono::high_resolution_clock::now().time_since_epoch().count());

    size_t v1_size(3000);
    size_t v2_size(10000);

    std::vector<size_t> hd;
    std::vector<std::bitset<100>> vec1;
    std::vector<std::bitset<100>> vec2;
    vec1.reserve(v1_size);
    vec2.reserve(v2_size);
    hd.reserve(v1_size * v2_size); /// Edited from hd.reserve(v1_size);

    for (size_t i = 0; i < v1_size; ++i)
    {
        vec1.emplace_back(rand_bitset());
    }
    for (size_t i = 0; i < v2_size; ++i)
    {
        vec2.emplace_back(rand_bitset());
    }

    std::cout << "vec1 size: " << vec1.size() << '\n'
              << "vec2 size: " << vec2.size() << '\n';

    auto start(std::chrono::high_resolution_clock::now());
    for (const auto& b1 : vec1)
    {
        for (const auto& b2 : vec2)
        {
            /// Count only the bits that are set and store them
            hd.emplace_back((b1 ^ b2).count());
        }
    }
    auto time(std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count());

    std::cout << vec1.size() << " x " << vec2.size()
              << " xor operations on 100 bits took " << time << " ms\n";
    return 0;
}

On my machine, the whole operation (3000 x 10000) takes about 300 ms.

You could put this into a function, compile it into a library and call it from Python. Another option is to store the distances to a file and then read them in Python.


EDIT: I had the wrong size for the hd vector. Reserving the proper amount of memory reduces the operation to about 190 ms because relocations are avoided.

like image 25
cantordust Avatar answered Oct 22 '22 12:10

cantordust