Let's say I have a dataset that look like this:
{A:1, B:3, C:6, D:6}
I also have a list of other sets to compare my specific set:
{A:1, B:3, C:6, D:6},
{A:2, B:3, C:6, D:6},
{A:99, B:3, C:6, D:6},
{A:5, B:1, C:6, D:9},
{A:4, B:2, C:2, D:6}
My entries could be visualized as a Table (with four columns, A, B, C, D, and E).
How can I find the set with the most similarity? For this example, row 1 is a perfect match and row 2 is a close second, while row 3 is quite far away.
I am thinking of calculating a simple delta, for example: Abs(a1 - a2) + Abs(b1 - b2) + etc
and perhaps get a correlation value for the entries with the best deltas.
Is this a valid way? And what is the name of this problem?
We will declare 'INTERSECTION_SIZE = 0' which stores the number of common elements in 'ARR1' and 'ARR2'. Sort array/list 'ARR1' and 'ARR2'. Since 'ARR2' is sorted we can do binary search. For each element in 'ARR1' we will do binary search in 'ARR2' and check if it is present in 'ARR2'.
The Sørensen–Dice distance is a statistical metric used to measure the similarity between sets of data. It is defined as two times the size of the intersection of P and Q, divided by the sum of elements in each data set P and Q. Sørensen–Dice coefficient. Like Jaccard, the similarity values range from zero to one.
In a Data Mining sense, the similarity measure is a distance with dimensions describing object features. That means if the distance among two data points is small then there is a high degree of similarity among the objects and vice versa. The similarity is subjective and depends heavily on the context and application.
Definitions: The similarity between two objects is a numeral measure of the degree to which the two objects are alike. Consequently, similarities are higher for pairs of objects that are more alike. Similarities are usually non-negative and are often between 0 (no similarity) and 1(complete similarity).
"Distance" or "similarity" could refer to this type of problem.
Simply calculating the sum of absolute difference, as you've done, should work fairly well. This is called the Manhattan distance. In mathematical terms, it would be: ∑x ∈ (a,b,c,d) Abs(x1 - x2)
.
Although the best measure really depends on what behaviour you want.
Ratio could potentially be a better idea.
Consider something like 1000000, 5, 5, 5
vs 999995, 5, 5, 5
and 1000000, 0, 5, 5
.
According to the above formula, the first would have the same similarity to both the second and the third.
If this is not desired (as 999995
can be considered pretty close to 1000000
, while 0
can be thought of as quite far from 5
), you should divide by the maximum of the two when calculating each distance.
∑x ∈ (a,b,c,d) [ Abs(x1 - x2) / max(x1, x2) ]
This will put every number between 0 and 1, which is the percentage difference between the values.
This means that, for our above example, we'd consider 1000000, 5, 5, 5
and 999995, 5, 5, 5
to be very similar (since the above sum will be |1000000-999995|/1000000 + 0 + 0 + 0 = 0.000005
) and 1000000, 5, 5, 5
and 1000000, 0, 5, 5
will be considered much more different (since the sum will be |0+5|/5 + 0 + 0 + 0 = 1
).
If negative values are possible, the formula would need to be updated appropriately. You'd need to decide how you want to handle that based on the problem you're trying to solve. Should 10 to 0
be more or less different than (or equivalent to) 5 to -5
?
Consider something like A=1, B=2, C=3, D=4
and A=4, B=1, C=2, D=3
.
While every individual element has changed, the set still consists of 1, 2, 3, 4
and each element is simply shifted by 1 position (apart from 4
).
For some problems this isn't going to matter at all and the above wouldn't be all that different than going from A=1, B=11, C=21, D=31
to A=2, B=12, C=22, D=32
. For other problems it could be quite relevant though.
For a sequence like a string or array, the idea of inserting, deleting or shifting elements could make sense. If so, you would want to look at edit distance, a common one of which would be Levenshtein distance. You might also want to think about modifying this to consider how much individual values differ by (but this would not be trivial).
For something like a set, elements are interchangeable, but there wouldn't really be a strict order on the elements ({1, 2, 3}
is the same as {3, 1, 2}
). If this is the case, the simplest might be to sort the values and just use edit distance. You may also be able to loop through both at the same time in some way, which would allow you to more easily take the differences between values into account.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With