Sorting strings so that hamming distance is low between adjacent strings

People also ask

How do you calculate Hamming distance between strings?

Example: Lets say i have 2 strings of equal length "ABCD" and "ACCC". In this example first and second last are same in both string so you need to substitute 2 character in 2nd to make it same hence, hamming distance is 2 in this case. so total hamming distance is 1 + 1 = 2.

What is the Hamming distance between the data?

A Hamming distance in information technology represents the number of points at which two corresponding pieces of data can be different. It is often used in various kinds of error correction or evaluation of contrasting strings or pieces of data.

How do you find the minimum Hamming distance?

In order to calculate the Hamming distance between two strings, and , we perform their XOR operation, (a⊕ b), and then count the total number of 1s in the resultant string.

Problem:

I have N (~100k-1m) strings each D (e.g. 2000) characters long and with a low alphabet (eg 3 possible characters). I would like to sort these strings such that there are as few possible changes between adjacent strings (eg hamming distance is low). Solution doesn't have to be the best possible but closer the better.

Example

N=4
D=5
//initial strings
1. aaacb
2. bacba
3. acacb
4. cbcba

//sorted so that hamming distance between adjacent strings is low
1. aaacb
3. acacb (Hamming distance 1->3 = 1)
4. cbcba (Hamming distance 3->4 = 4)
2. bacba (Hamming distance 4->2 = 2)

Thoughts about the problem

I have a bad feeling this is a non trivial problem. If we think of each string as a node and the distances to other strings as an edge, then we are looking at a travelling salesman problem. The large number of strings means that calculating all of the pairwise distances beforehand is potentially infeasible, I think turning the problem into some more like the Canadian Traveller Problem.

At the moment my solution has been to use a VP tree to find a greedy nearest neighbour type solution to the problem

curr_string = a randomly chosen string from full set
while(tree not empty)
    found_string = find nearest string in tree
    tree.remove(found_string)
    sorted_list.add(curr_string)
    curr_string = found_string

but initial results appear to be poor. Hashing strings so that more similar ones are closer may be another option but I know little about how good a solution this will provide or how well it will scale to data of this size.

Related questions
                            
                                Managing Core Data iCloud Transaction Logs
                            
                                Can I use In-App Purchase on iOS pirated application to ask for the user to pay for it?
                            
                                Build a JSON tree from materialized paths
                            
                                How to monitor free memory ( including buffers and cache) in java?
                            
                                Does anybody know any very basic stm32 tutorials? [closed]
                            
                                Did I just write a continuation?
                            
                                Looking for ideas debugging a tricky windows service startup gremlin
                            
                                Is it correct for Java system properties to be used to set and get arbitrary program parameters?
                            
                                Why is this double-checked lock implemented with a separate wrapper class?
                            
                                How to add actions to the top part of a split ActionBar
                            
                                how to add file attachment in PHPMailer?
                            
                                How to add a custom annotation to Spring MVC?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sorting strings so that hamming distance is low between adjacent strings

Tags:

People also ask

Recent Activity

Donate For Us