Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to measure similarity between two sequences of strings

Tags:

sequences

How can I measure similarity-percentage between two sequences of strings?

I have two text files and In files there sequences are written like

First file:

AAA BBB DDD CCC GGG MMM AAA MMM

Second file:

BBB DDD CCC MMM AAA MMM

How to measure similarity between these two files in terms of order of strings?

For example in above example both files have similarity due to order of strings is same however some strings are missing in file-2. What algorithm is best suitable to solve this problem so that I can measure how similar is order of strings not frequency of strings in two?

like image 414
Dheeraj Agarwal Avatar asked Jun 01 '12 05:06

Dheeraj Agarwal


People also ask

How do you find the similarity between two strings?

Hamming Distance, named after the American mathematician, is the simplest algorithm for calculating string similarity. It checks the similarity by comparing the changes in the number of positions between the two strings.

What is similarity algorithm?

Similarity algorithms compute the similarity of pairs of nodes based on their neighborhoods or their properties. Several similarity metrics can be used to compute a similarity score.

How do you find the similarity between two strings in python?

Use the == and != operators to compare two strings for equality. Use the is operator to check if two strings are the same instance. Use the < , > , <= , and >= operators to compare strings alphabetically.

Which method is used to measure the similarity?

The Sørensen–Dice distance is a statistical metric used to measure the similarity between sets of data. It is defined as two times the size of the intersection of P and Q, divided by the sum of elements in each data set P and Q.


1 Answers

You could use the Levenstein Distance algorithm. It analyzes how many edits that is needed to transform one string into another. This article explains it pretty well, and a sample implementation is provided.

Copy paste from Codeproject:

1.  Set n to be the length of s. ("GUMBO")
    Set m to be the length of t. ("GAMBOL")
    If n = 0, return m and exit.
    If m = 0, return n and exit.
    Construct two vectors, v0[m+1] and v1[m+1], containing 0..m elements.
2.  Initialize v0 to 0..m.
3.  Examine each character of s (i from 1 to n).
4.  Examine each character of t (j from 1 to m).
5.  If s[i] equals t[j], the cost is 0.
    If s[i] is not equal to t[j], the cost is 1.
6.  Set cell v1[j] equal to the minimum of:
    a. The cell immediately above plus 1: v1[j-1] + 1.
    b. The cell immediately to the left plus 1: v0[j] + 1.
    c. The cell diagonally above and to the left plus the cost: v0[j-1] + cost.
7.  After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell v1[m].
like image 154
alexn Avatar answered Nov 30 '22 17:11

alexn