I want to loop through a database of documents and calculate a pairwise comparison score.
A simplistic, naive method would nest a loop within another loop. This would result in the program comparing documents twice and also comparing each document to itself.
Is there a name for the algorithm for doing this task efficiently? Is there a name for this approach?
Thanks.
Assume all items have a number ItemNumber
Simple solution -- always have the 2nd element's ItemNumber greater than the first item.
eg
for (firstitem = 1 to maxitemnumber)
for (seconditem = firstitemnumber+1 to maxitemnumber)
compare(firstitem, seconditem)
visual note: if you think of the compare as a matrix (item number of one on one axis item of the other on the other axis) this looks at one of the triangles.
........
x.......
xx......
xxx.....
xxxx....
xxxxx...
xxxxxx..
xxxxxxx.
I don't think its complicated enough to qualify for a name.
You can avoid duplicate pairs just by forcing a comparison on any value which might be different between different rows - the primary key is an obvious choice, e.g.
Unique pairings:
SELECT a.item as a_item, b.item as b_item
FROM table AS a, table AS b
WHERE a.id<b.id
Potentially there are a lot of ways in which the the comparison operation can be used to generate data summmaries and therefore identify potentially similar items - for single words the soundex is an obvious choice - however you don't say what your comparison metric is.
C.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With