Similarity between two data sets or arrays

Tags:

Let's say I have a dataset that look like this:

{A:1, B:3, C:6, D:6}

I also have a list of other sets to compare my specific set:

{A:1, B:3, C:6, D:6},  
{A:2, B:3, C:6, D:6},  
{A:99, B:3, C:6, D:6},  
{A:5, B:1, C:6, D:9},  
{A:4, B:2, C:2, D:6}

My entries could be visualized as a Table (with four columns, A, B, C, D, and E).

How can I find the set with the most similarity? For this example, row 1 is a perfect match and row 2 is a close second, while row 3 is quite far away.

I am thinking of calculating a simple delta, for example: Abs(a1 - a2) + Abs(b1 - b2) + etc and perhaps get a correlation value for the entries with the best deltas.

Is this a valid way? And what is the name of this problem?

806

asked Nov 06 '13 14:11

Anders

1 Answers

"Distance" or "similarity" could refer to this type of problem.

Simply calculating the sum of absolute difference, as you've done, should work fairly well. This is called the Manhattan distance. In mathematical terms, it would be: ∑_{x ∈ (a,b,c,d)} Abs(x₁ - x₂).

Although the best measure really depends on what behaviour you want.

Ratio could potentially be a better idea.

Consider something like 1000000, 5, 5, 5 vs 999995, 5, 5, 5 and 1000000, 0, 5, 5.

According to the above formula, the first would have the same similarity to both the second and the third.

If this is not desired (as 999995 can be considered pretty close to 1000000, while 0 can be thought of as quite far from 5), you should divide by the maximum of the two when calculating each distance.

∑_{x ∈ (a,b,c,d)} [ Abs(x₁ - x₂) / max(x₁, x₂) ]

This will put every number between 0 and 1, which is the percentage difference between the values.

This means that, for our above example, we'd consider 1000000, 5, 5, 5 and 999995, 5, 5, 5 to be very similar (since the above sum will be |1000000-999995|/1000000 + 0 + 0 + 0 = 0.000005) and 1000000, 5, 5, 5 and 1000000, 0, 5, 5 will be considered much more different (since the sum will be |0+5|/5 + 0 + 0 + 0 = 1).

If negative values are possible, the formula would need to be updated appropriately. You'd need to decide how you want to handle that based on the problem you're trying to solve. Should 10 to 0 be more or less different than (or equivalent to) 5 to -5?

Are elements interchangeable to any degree?

Consider something like A=1, B=2, C=3, D=4 and A=4, B=1, C=2, D=3.

While every individual element has changed, the set still consists of 1, 2, 3, 4 and each element is simply shifted by 1 position (apart from 4).

For some problems this isn't going to matter at all and the above wouldn't be all that different than going from A=1, B=11, C=21, D=31 to A=2, B=12, C=22, D=32. For other problems it could be quite relevant though.

For a sequence like a string or array, the idea of inserting, deleting or shifting elements could make sense. If so, you would want to look at edit distance, a common one of which would be Levenshtein distance. You might also want to think about modifying this to consider how much individual values differ by (but this would not be trivial).

For something like a set, elements are interchangeable, but there wouldn't really be a strict order on the elements ({1, 2, 3} is the same as {3, 1, 2}). If this is the case, the simplest might be to sort the values and just use edit distance. You may also be able to loop through both at the same time in some way, which would allow you to more easily take the differences between values into account.

179

answered Sep 28 '22 10:09

Bernhard Barker

Related questions
                            
                                Intersection of polygons
                            
                                Verify if a list (or a sublist of that list) of decimal values can equal a certain sum
                            
                                what is the algorithm to optimally fill a dvd for burning
                            
                                How to find if there are n consecutive set bits in a 32 bit buffer?
                            
                                Partition a set into k groups with minimum number of moves
                            
                                Horner's recursive algorithm for fractional part - Java
                            
                                Locally weighted logistic regression
                            
                                Make Levenstein's Distance algorithm fit my needs
                            
                                Complexity of std::find_end as Big-O
                            
                                How to generate all multiplicative partitions of a number if I have a list of primes/exponents?
                            
                                Printing all possible words from a 2D array of characters
                            
                                Comparing two documents using regex
                            
                                Understanding B+ tree insertion
                            
                                Glissando Function whose arguments are the extremes of the codomain [closed]
                            
                                How to handle multiple simultaneous elastic collisions?
                            
                                How to write a probability algorithm that can be maintained easily?
                            
                                Fuel chart smoothing algorithm
                            
                                Select combination of elements from array whose sum is smallest possible positive number
                            
                                128-bit struct or 2 64-bit records for performance and readibility
                            
                                Sum of number of divisor of number between a and b inclusive

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Similarity between two data sets or arrays

Tags:

algorithm

similarity

correlation

Anders

People also ask

1 Answers

Are elements interchangeable to any degree?

Bernhard Barker

Recent Activity

Donate For Us