Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Measuring the similarity between two irregular plots

I have two irregular lines as a list of [x,y] coordinates, which has peaks and troughs. The length of the list might vary slightly(unequal). I want to measure their similarity such that to check occurence of the peaks and troughs (of similar depth or height) are coming at proper interval and give a similarity measure. I want to do this in Python. Is there any inbuilt function to do this?

enter image description here enter image description here

like image 788
pradeepln4 Avatar asked Sep 16 '15 05:09

pradeepln4


People also ask

How do you measure the similarity between two sets of data?

The Sørensen–Dice distance is a statistical metric used to measure the similarity between sets of data. It is defined as two times the size of the intersection of P and Q, divided by the sum of elements in each data set P and Q. Sørensen–Dice coefficient. Like Jaccard, the similarity values range from zero to one.

How do you find the similarity between two curves?

to measure similarity there is a measure called MIC: Maximal information coefficient. It quantifies the information shared between 2 data or curves.

How do you measure similarity between two clusters?

To calculate the similarity between two examples, you need to combine all the feature data for those two examples into a single numeric value. For instance, consider a shoe data set with only one feature: shoe size. You can quantify how similar two shoes are by calculating the difference between their sizes.


1 Answers

I don't know of any builtin functions in Python to do this.

I can give you a list of possible functions in the Python ecosystem you can use. This is in no way a complete list of functions, and there are probably quite a few methods out there that I am not aware of.

If the data is ordered, but you don't know which data point is the first and which data point is last:

  1. Use the directed Hausdorff distance

If the data is ordered, and you know the first and last points are correct:

  1. Discrete Fréchet distance *
  2. Dynamic Time Warping (DTW) *
  3. Partial Curve Mapping (PCM) **
  4. A Curve-Length distance metric (uses arc length distance from beginning to end) **
  5. Area between two curves **

* Generally mathematical method used in a variety of machine learning tasks

** Methods I've used to identify unique material hysteresis responses

First let's assume we have two of the exact same random X Y data. Note that all of these methods will return a zero. You can install the similaritymeasures from pip if you do not have it.

import numpy as np
from scipy.spatial.distance import directed_hausdorff
import similaritymeasures
import matplotlib.pyplot as plt

# Generate random experimental data
np.random.seed(121)
x = np.random.random(100)
y = np.random.random(100)
P = np.array([x, y]).T

# Generate an exact copy of P, Q, which we will use to compare
Q = P.copy()

dh, ind1, ind2 = directed_hausdorff(P, Q)
df = similaritymeasures.frechet_dist(P, Q)
dtw, d = similaritymeasures.dtw(P, Q)
pcm = similaritymeasures.pcm(P, Q)
area = similaritymeasures.area_between_two_curves(P, Q)
cl = similaritymeasures.curve_length_measure(P, Q)

# all methods will return 0.0 when P and Q are the same
print(dh, df, dtw, pcm, cl, area)

The printed output is 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 This is because the curves P and Q are exactly the same!

Now let's assume P and Q are different.

# Generate random experimental data
np.random.seed(121)
x = np.random.random(100)
y = np.random.random(100)
P = np.array([x, y]).T

# Generate random Q
x = np.random.random(100)
y = np.random.random(100)
Q = np.array([x, y]).T

dh, ind1, ind2 = directed_hausdorff(P, Q)
df = similaritymeasures.frechet_dist(P, Q)
dtw, d = similaritymeasures.dtw(P, Q)
pcm = similaritymeasures.pcm(P, Q)
area = similaritymeasures.area_between_two_curves(P, Q)
cl = similaritymeasures.curve_length_measure(P, Q)

# all methods will return 0.0 when P and Q are the same
print(dh, df, dtw, pcm, cl, area)

The printed output is 0.107, 0.743, 37.69, 21.5, 6.86, 11.8 which quantify how different P is from Q according to each method.

You now have many methods to compare the two curves. I would start with DTW, since this has been used in many time series applications which look like the data you have uploaded.

We can visualize what P and Q look like with the following code.

plt.figure()
plt.plot(P[:, 0], P[:, 1])
plt.plot(Q[:, 0], Q[:, 1])
plt.show()

Two random paths in the XY space

like image 118
Charles Jekel Avatar answered Oct 04 '22 03:10

Charles Jekel