I'm trying to generate n
binary vectors of some arbitrary length l
, where each vector i
has a Hamming distance of d
(where d
is even) from every other vector j
. I'm not sure if there are any theoretical relationships between n
, l
, and d
, but I'm wondering if there are any implementations for this task. My current implementation is shown below. Sometimes I am successful, other times the code hangs, which indicates either a) it's not possible to find n
such vectors given l
and d
, or b) the search takes a long time especially for large values of l
.
My questions are:
What kind of theoretical relationships exist between n
, l
, and d
?
import numpy as np
def get_bin(n):
return ''.join([str(np.random.randint(0, 2)) for _ in range(n)])
def hamming(s1, s2):
return sum(c1 != c2 for c1, c2 in zip(s1, s2))
def generate_codebook(n, num_codes, d):
codebooks = []
seen = []
while len(codebooks) < num_codes:
code = get_bin(n)
if code in seen:
continue
else:
if len(codebooks) == 0:
codebooks.append(code)
print len(codebooks), code
else:
if all(map(lambda x: int(hamming(code, x)) == d, codebooks)):
codebooks.append(code)
print len(codebooks), code
seen.append(code)
codebook_vectorized = map(lambda x: map(lambda b: int(b), x), codebooks)
return np.array(codebook_vectorized)
Example:
codebook = generate_codebook(4,3,2)
codebook
1 1111
2 1001
3 0101
To calculate the Hamming distance, you simply count the number of bits where two same-length messages differ. An example of Hamming distance 1 is the distance between 1101 and 1001 . If you increase the distance to 2 , we can give as an example 1001 and 1010 .
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance, d(11011001, 10011101) = 2.
The Hamming distance involves counting up which set of corresponding digits or places are different, and which are the same. For example, take the text string “hello world” and contrast it with another text string, “herra poald.” There are five places along the corresponding strings where the letters are different.
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.
Let's build a graph G
where every L
-bit binary vector v
is a vertex. And there is an edge (vi, vj)
only when a Hamming distance between vi
and vj
is equal to d
. Now we need to find a clique of size n
is this graph.
Clique is a subset of vertices of an undirected graph such that every two distinct vertices in the clique are adjacent.
The task of finding a clique of given size in an arbitrary graph is NP-complete. You can read about this problem and some algorithms in this wikipedia article.
There are many special cases of this problem. For example, for perfect graphs there is a polynomial algorithm. Don't know if it is possible to show that our graph is one of these special cases.
Not a real solution, but more of a partial discussion about the relationship between l
, d
and n
and the process of generating vectors. In any case, you may consider posting the question (or a similar one, in more formal terms) to Mathematics Stack Exchange. I have been reasoning as I was writing, but I hope I didn't make a mistake.
Let's say we have l = 6
. Since the Hamming distance depends only on position-wise differences, you can decide to start by putting one first arbitrary vector in your set (if there are solutions, some may not include it, but at least one should). So let's begin with an initial v1 = 000000
. Now, if d = 1
then obviously n
can only be 1 or 2 (with 111111
). If d = 1
, you will find that n
can also only be 1 or 2; for example, you could add 000001
, but any other possible vector will have a distance of 2 or more with at least one the vectors you have.
Let's say d = 4
. You need to change 4 positions and keep the other 2, so you have 4-combinations from a 6-element set, which is 15 choices, 001111
, 010111
, etc. - you can see now that the binomial coefficient C(n, d)
plus 1 is an upper bound for n
. Let's pick v2 = 001111
, and say that the kept positions are T = [1, 2]
and the changed ones are S = [3, 4, 5, 6]
. Now to go on, we could consider making changes to v2
; however, in order to keep the right distances we must follow these rules:
v2
.S
, we must make another change in a position in T
(and viceversa). Otherwise, the distance to v1
would not be kept.Logically, if d
were odd you would be done now (only sets of two elements could be formed), but fortunately you already said that your distance numbers are even. So we divide our number by two, which is 2, and need to pick 2 elements from S
, C(4, 2) = 6
, and 2 elements from T
, C(2, 2) = 1
, giving us 6 * 1 = 6
options - you should note now that C(d, d/2) * C(l - d, d/2) + 2
is a new, lower upper bound for n
, if d
is even. Let's pick v3 = 111100
. v3
has now four kinds of positions: positions that have changed with respect to both v1
and v2
, P1 = [1, 2]
, positions that have not changed with respect to either v1
or v2
, P2 = []
(none in this case), positions that have changed with respect to v1
but not with respect to v2
, P3 = [3, 4]
, and positions that have changed with respect to v2
but not with respect to v1
, P4 = [5, 6]
. Same deal, we need 4 changes, but now each change we make to a P1
position must imply a change in a P2
position, and each change we make to a P3
position must imply a change in a P4
position. The only remaining option is v4 = 110011
, and that would be it, the maximum n
would be 4.
So, thinking about the problem from a combinatoric point of view, after each change you will have an exponentially increasing number of "types of positions" (2 after the first change, 4 after the second, 8, 16...) defined in terms of whether they are equal or not in each of the previously added vectors, and these can be arranged in couples through a "symmetry" or "complement" relationship. On each step, you can (I think, and this is the part of this reasoning that I am less sure about) greedily choose a set of changes from these couples and compute the sizes of the "types of positions" for the next step. If this is all correct, you should be able to write an algorithm based on this to generate and/or count the possible sets of vectors for some particular l
and d
and n
if given.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With