Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to sort rows and cols by similarity

I fell across a spreadsheet that explains a method to sort both rows and columns of a matrix that contains binary data so that the number of changes between consecutive rows and cols is minimzed.

For example, starting with:

initial table

After 15 manual steps described in the tabs of the spreadsheed, the following table is obtained:

final result

I would like to know:

  1. what is the common name of this algorithm or method ?
  2. how to apply it to larger table (where 2^n would overflow...)
  3. how to generalize it to non binary data, for example using Levenshtein distance ?
  4. if there is any link to code (Excel VBA, Python, ...) already implementing this (otherwise I'll write it ... )

Thanks !

like image 759
Dr. Goulu Avatar asked Apr 11 '16 08:04

Dr. Goulu


1 Answers

You can represent each row by a vector L = [1, 1, 0, ... 1], and then define the distance between two lines d(L0, L1) by the number of elements at corresponding positions which are different between L0 and L1. This is known as the binary Hamming distance. If you had non-binary data, you would just extend your definition of distance and yes, Levenshtein distance would be an option.

Once you have distance well-defined, the rest of your problem is minimizing distance between consecutive rows. This is exactly the Traveling salesman problem, which is known to be NP-hard(http://www.diku.dk/hjemmesider/ansatte/jyrki/Paper/EKP85.pdf).

The direct solution (visiting all permutations) is O(n!), but you can do better easily by using dynamic programming, for example Held–Karp_algorithm. There are also approximate algorithms, such as the Nearest_neighbour_algorithm which quickly computes a non-optimal solution.

Finally, for implementations you can easily google "traveling salesman excel/python" and find many tutorials and examples.

like image 112
Giovanni Funchal Avatar answered Oct 20 '22 22:10

Giovanni Funchal