Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating n binary vectors where each vector has a Hamming distance of d from every other vector

I'm trying to generate n binary vectors of some arbitrary length l, where each vector i has a Hamming distance of d (where d is even) from every other vector j. I'm not sure if there are any theoretical relationships between n, l, and d, but I'm wondering if there are any implementations for this task. My current implementation is shown below. Sometimes I am successful, other times the code hangs, which indicates either a) it's not possible to find n such vectors given l and d, or b) the search takes a long time especially for large values of l.

My questions are:

  • Are there any efficient implementations of this task?
  • What kind of theoretical relationships exist between n, l, and d?

    import numpy as np
    
    def get_bin(n):
        return ''.join([str(np.random.randint(0, 2)) for _ in range(n)])
    
    def hamming(s1, s2):
        return sum(c1 != c2 for c1, c2 in zip(s1, s2))
    
    def generate_codebook(n, num_codes, d):    
        codebooks = []
        seen = []
    
        while len(codebooks) < num_codes:
            code = get_bin(n)
            if code in seen:
                continue
            else:
                if len(codebooks) == 0:
                    codebooks.append(code)
                    print len(codebooks), code
                else:
                    if all(map(lambda x: int(hamming(code, x)) == d, codebooks)):
                        codebooks.append(code)
                        print len(codebooks), code
                seen.append(code)
    
        codebook_vectorized = map(lambda x: map(lambda b: int(b), x), codebooks)
        return np.array(codebook_vectorized)
    

Example:

codebook = generate_codebook(4,3,2)
codebook

1 1111
2 1001
3 0101
like image 801
Sean Saito Avatar asked Jan 17 '18 09:01

Sean Saito


People also ask

What is Hamming distance formula?

To calculate the Hamming distance, you simply count the number of bits where two same-length messages differ. An example of Hamming distance 1 is the distance between 1101 and 1001 . If you increase the distance to 2 , we can give as an example 1001 and 1010 .

What is the Hamming distance between 1101 1010?

11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance, d(11011001, 10011101) = 2.

What is Hamming distance explain with suitable example?

The Hamming distance involves counting up which set of corresponding digits or places are different, and which are the same. For example, take the text string “hello world” and contrast it with another text string, “herra poald.” There are five places along the corresponding strings where the letters are different.

What is the Hamming distance between the data?

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.


2 Answers

Let's build a graph G where every L-bit binary vector v is a vertex. And there is an edge (vi, vj) only when a Hamming distance between vi and vj is equal to d. Now we need to find a clique of size n is this graph.

Clique is a subset of vertices of an undirected graph such that every two distinct vertices in the clique are adjacent.

The task of finding a clique of given size in an arbitrary graph is NP-complete. You can read about this problem and some algorithms in this wikipedia article.

There are many special cases of this problem. For example, for perfect graphs there is a polynomial algorithm. Don't know if it is possible to show that our graph is one of these special cases.

like image 67
DAle Avatar answered Sep 19 '22 23:09

DAle


Not a real solution, but more of a partial discussion about the relationship between l, d and n and the process of generating vectors. In any case, you may consider posting the question (or a similar one, in more formal terms) to Mathematics Stack Exchange. I have been reasoning as I was writing, but I hope I didn't make a mistake.

Let's say we have l = 6. Since the Hamming distance depends only on position-wise differences, you can decide to start by putting one first arbitrary vector in your set (if there are solutions, some may not include it, but at least one should). So let's begin with an initial v1 = 000000. Now, if d = 1 then obviously n can only be 1 or 2 (with 111111). If d = 1, you will find that n can also only be 1 or 2; for example, you could add 000001, but any other possible vector will have a distance of 2 or more with at least one the vectors you have.

Let's say d = 4. You need to change 4 positions and keep the other 2, so you have 4-combinations from a 6-element set, which is 15 choices, 001111, 010111, etc. - you can see now that the binomial coefficient C(n, d) plus 1 is an upper bound for n. Let's pick v2 = 001111, and say that the kept positions are T = [1, 2] and the changed ones are S = [3, 4, 5, 6]. Now to go on, we could consider making changes to v2; however, in order to keep the right distances we must follow these rules:

  • We must make 4 changes to v2.
  • If we change a position in a position in S, we must make another change in a position in T (and viceversa). Otherwise, the distance to v1 would not be kept.

Logically, if d were odd you would be done now (only sets of two elements could be formed), but fortunately you already said that your distance numbers are even. So we divide our number by two, which is 2, and need to pick 2 elements from S, C(4, 2) = 6, and 2 elements from T, C(2, 2) = 1, giving us 6 * 1 = 6 options - you should note now that C(d, d/2) * C(l - d, d/2) + 2 is a new, lower upper bound for n, if d is even. Let's pick v3 = 111100. v3 has now four kinds of positions: positions that have changed with respect to both v1 and v2, P1 = [1, 2], positions that have not changed with respect to either v1 or v2, P2 = [] (none in this case), positions that have changed with respect to v1 but not with respect to v2, P3 = [3, 4], and positions that have changed with respect to v2 but not with respect to v1, P4 = [5, 6]. Same deal, we need 4 changes, but now each change we make to a P1 position must imply a change in a P2 position, and each change we make to a P3 position must imply a change in a P4 position. The only remaining option is v4 = 110011, and that would be it, the maximum n would be 4.

So, thinking about the problem from a combinatoric point of view, after each change you will have an exponentially increasing number of "types of positions" (2 after the first change, 4 after the second, 8, 16...) defined in terms of whether they are equal or not in each of the previously added vectors, and these can be arranged in couples through a "symmetry" or "complement" relationship. On each step, you can (I think, and this is the part of this reasoning that I am less sure about) greedily choose a set of changes from these couples and compute the sizes of the "types of positions" for the next step. If this is all correct, you should be able to write an algorithm based on this to generate and/or count the possible sets of vectors for some particular l and d and n if given.

like image 31
jdehesa Avatar answered Sep 21 '22 23:09

jdehesa