Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently determine "how sorted" a list is, eg. Levenshtein distance

I'm doing some research on ranking algorithms, and would like to, given a sorted list and some permutation of that list, calculate some distance between the two permutations. For the case of the Levenshtein distance, this corresponds to calculating the distance between a sequence and a sorted copy of that sequence. There is also, for instance, the "inversion distance", a linear-time algorithm of which is detailed here, which I am working on implementing.

Does anyone know of an existing python implementation of the inversion distance, and/or an optimization of the Levenshtein distance? I'm calculating this on a sequence of around 50,000 to 200,000 elements, so O(n^2) is far too slow, but O(n log(n)) or better should be sufficient.

Other metrics for permutation similarity would also be appreciated.


Edit for people from the future:

Based on Raymond Hettinger's response; it's not Levenshtein or inversion distance, but rather "gestalt pattern matching" :P

from difflib import SequenceMatcher
import random
ratings = [random.gauss(1200, 200) for i in range(100000)]
SequenceMatcher(None, ratings, sorted(ratings)).ratio()

runs in ~6 seconds on a terrible desktop.

Edit2: If you can coerce your sequence into a permutation of [1 .. n], then a variation of the Manhattan metric is extremely fast and has some interesting results.

manhattan = lambda l: sum(abs(a - i) for i, a in enumerate(l)) / (0.5 * len(l) ** 2)
rankings = list(range(100000))
random.shuffle(rankings)
manhattan(rankings) # ~ 0.6665, < 1 second

The normalization factor is technically an approximation; it is correct for even sized lists, but should be (0.5 * (len(l) ** 2 - 1)) for odd sized lists.

Edit3: There are several other algorithms for checking list similarity! The Kendall Tau ranking coefficient and the Spearman ranking coefficient. Implementations of these are available in the SciPy library as scipy.stats.kendalltau and scipy.stats.rspearman, and will return the ranks along with the associated p-values.

like image 262
stefan Avatar asked Nov 21 '11 02:11

stefan


People also ask

How is Levenshtein distance calculated?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.

What is the difference between edit distance and Levenshtein distance?

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

What is the difference between Hamming distance and Levenshtein distance?

The Hamming distance is the number of positions at which the corresponding symbols in the two strings are different. The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).


1 Answers

Levenshtein distance is an O(n**2) algorithm, so if you want to go faster, use the alternative fast algorithm in the difflib module. The ratio method computes a measure of similarity between two sequences.

If you have to stick with Levenshtein, there is a Python recipe for it on the ASPN Python Cookbook: http://code.activestate.com/recipes/576874-levenshtein-distance/ .

Another Python script can be found at: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python

like image 86
Raymond Hettinger Avatar answered Oct 11 '22 12:10

Raymond Hettinger