Python - How to generate the Pairwise Hamming Distance Matrix

Tags:

beginner with Python here. So I'm having trouble trying to calculate the resulting binary pairwise hammington distance matrix between the rows of an input matrix using only the numpy library. I'm supposed to avoid loops and use vectorization. If for instance I have something like:

   [ 1,  0,  0,  1,  1,  0]
   [ 1,  0,  0,  0,  0,  0]
   [ 1,  1,  1,  1,  0,  0]

The matrix should be something like:

   [ 0,  2,  3]
   [ 2,  0,  3]
   [ 3,  3,  0]

ie if the original matrix was A and the hammingdistance matrix is B. B[0,1] = hammingdistance (A[0] and A[1]). In this case the answer is 2 as they only have two different elements.

So for my code is something like this

def compute_HammingDistance(X):

     hammingDistanceMatrix = np.zeros(shape = (len(X), len(X)))
     hammingDistanceMatrix = np.count_nonzero ((X[:,:,None] != X[:,:,None].T))
     return hammingDistanceMatrix

However it seems to just be returning a scalar value instead of the intended matrix. I know I'm probably doing something wrong with the array/vector broadcasting but I can't figure out how to fix it. I've tried using np.sum instead of np.count_nonzero but they all pretty much gave me something similar.

297

asked Mar 12 '17 20:03

user2444400

2 Answers

Try this approach, create a new axis along axis = 1, and then do broadcasting and count trues or non zero with sum:

(arr[:, None, :] != arr).sum(2)

# array([[0, 2, 3],
#        [2, 0, 3],
#        [3, 3, 0]])

def compute_HammingDistance(X):
    return (X[:, None, :] != X).sum(2)

Explanation:

1) Create a 3d array which has shape (3,1,6)

arr[:, None, :]
#array([[[1, 0, 0, 1, 1, 0]],
#       [[1, 0, 0, 0, 0, 0]],
#       [[1, 1, 1, 1, 0, 0]]])

2) this is a 2d array has shape (3, 6)

arr   
#array([[1, 0, 0, 1, 1, 0],
#       [1, 0, 0, 0, 0, 0],
#       [1, 1, 1, 1, 0, 0]])

3) This triggers broadcasting since their shape doesn't match, and the 2d array arr is firstly broadcasted along the 0 axis of 3d array arr[:, None, :], and then we have array of shape (1, 6) be broadcasted against (3, 6). The two broadcasting steps together make a cartesian comparison of the original array.

arr[:, None, :] != arr 
#array([[[False, False, False, False, False, False],
#        [False, False, False,  True,  True, False],
#        [False,  True,  True, False,  True, False]],
#       [[False, False, False,  True,  True, False],
#        [False, False, False, False, False, False],
#        [False,  True,  True,  True, False, False]],
#       [[False,  True,  True, False,  True, False],
#        [False,  True,  True,  True, False, False],
#        [False, False, False, False, False, False]]], dtype=bool)

4) the sum along the third axis count how many elements are not equal, i.e, trues which gives the hamming distance.

137

answered Oct 22 '22 10:10

Psidom

For reasons I do not understand this

(2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2)

appears to be much faster than @Psidom's for larger arrays:

a = np.random.randint(0,2,(100,1000))
timeit(lambda: (a[:, None, :] != a).sum(2), number=100)
# 2.297890231013298
timeit(lambda: (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2), number=100)
# 0.10616962902713567

Psidom's is a bit faster for the very small example:

a
# array([[1, 0, 0, 1, 1, 0],
#        [1, 0, 0, 0, 0, 0],
#        [1, 1, 1, 1, 0, 0]])

timeit(lambda: (a[:, None, :] != a).sum(2), number=100)
# 0.0004370050155557692
timeit(lambda: (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2), number=100)
# 0.00068191799800843

Update

Part of the reason appears to be floats being faster than other dtypes:

timeit(lambda: (0.5 * np.inner(2*a-1, 1-2*a) + a.shape[1] / 2), number=100)
# 0.7315902590053156
timeit(lambda: (0.5 * np.inner(2.0*a-1, 1-2.0*a) + a.shape[1] / 2), number=100)
# 0.12021801102673635

answered Oct 22 '22 11:10

Paul Panzer

Related questions
                            
                                How to install numpy to Python 3.5?
                            
                                Plotting a choropleth map (with geopandas) using a user_defined classification scheme
                            
                                Make a contour plot by using three 1D arrays in python
                            
                                matplotlib: Can I interrupt an `axhline` with text?
                            
                                Inserting a list holding multiple values in MySQL using pymysql
                            
                                Python - how to pass a dictionary into defaultdict as value and not as a reference
                            
                                Incrementing class variables dynamically in Python
                            
                                python: can statement be inside expression?
                            
                                Why numpy converts 20000001 int to float32 as 20000000.?
                            
                                How to parse protobuf packets in Wireshark
                            
                                How the function dimshuffle works in Theano
                            
                                Django - Generating random, unique slug field for each model object
                            
                                Showing total on stacked bar Plotly
                            
                                How does Python ensure the return value of __len__ is an integer when len is called?
                            
                                Add a bookmark to a PDF with PyPDF2
                            
                                Python 3D Plots over non-rectangular domain
                            
                                Remove redundant square brackets in a list python [duplicate]
                            
                                Creating gist directly from Jupyper notebook?
                            
                                Python OpenCV - Extrapolating the largest rectangle off of a set of contour points
                            
                                Incremental Word2Vec Model Training in gensim

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - How to generate the Pairwise Hamming Distance Matrix

Tags:

python

vectorization

numpy

hamming-distance

user2444400

People also ask

2 Answers

Psidom

Paul Panzer

Recent Activity

Donate For Us