I have two irregular lines as a list of [x,y] coordinates, which has peaks and troughs. The length of the list might vary slightly(unequal). I want to measure their similarity such that to check occurence of the peaks and troughs (of similar depth or height) are coming at proper interval and give a similarity measure. I want to do this in Python. Is there any inbuilt function to do this?
The Sørensen–Dice distance is a statistical metric used to measure the similarity between sets of data. It is defined as two times the size of the intersection of P and Q, divided by the sum of elements in each data set P and Q. Sørensen–Dice coefficient. Like Jaccard, the similarity values range from zero to one.
to measure similarity there is a measure called MIC: Maximal information coefficient. It quantifies the information shared between 2 data or curves.
To calculate the similarity between two examples, you need to combine all the feature data for those two examples into a single numeric value. For instance, consider a shoe data set with only one feature: shoe size. You can quantify how similar two shoes are by calculating the difference between their sizes.
I don't know of any builtin functions in Python to do this.
I can give you a list of possible functions in the Python ecosystem you can use. This is in no way a complete list of functions, and there are probably quite a few methods out there that I am not aware of.
If the data is ordered, but you don't know which data point is the first and which data point is last:
If the data is ordered, and you know the first and last points are correct:
* Generally mathematical method used in a variety of machine learning tasks
** Methods I've used to identify unique material hysteresis responses
First let's assume we have two of the exact same random X Y data. Note that all of these methods will return a zero. You can install the similaritymeasures from pip if you do not have it.
import numpy as np
from scipy.spatial.distance import directed_hausdorff
import similaritymeasures
import matplotlib.pyplot as plt
# Generate random experimental data
np.random.seed(121)
x = np.random.random(100)
y = np.random.random(100)
P = np.array([x, y]).T
# Generate an exact copy of P, Q, which we will use to compare
Q = P.copy()
dh, ind1, ind2 = directed_hausdorff(P, Q)
df = similaritymeasures.frechet_dist(P, Q)
dtw, d = similaritymeasures.dtw(P, Q)
pcm = similaritymeasures.pcm(P, Q)
area = similaritymeasures.area_between_two_curves(P, Q)
cl = similaritymeasures.curve_length_measure(P, Q)
# all methods will return 0.0 when P and Q are the same
print(dh, df, dtw, pcm, cl, area)
The printed output is 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 This is because the curves P and Q are exactly the same!
Now let's assume P and Q are different.
# Generate random experimental data
np.random.seed(121)
x = np.random.random(100)
y = np.random.random(100)
P = np.array([x, y]).T
# Generate random Q
x = np.random.random(100)
y = np.random.random(100)
Q = np.array([x, y]).T
dh, ind1, ind2 = directed_hausdorff(P, Q)
df = similaritymeasures.frechet_dist(P, Q)
dtw, d = similaritymeasures.dtw(P, Q)
pcm = similaritymeasures.pcm(P, Q)
area = similaritymeasures.area_between_two_curves(P, Q)
cl = similaritymeasures.curve_length_measure(P, Q)
# all methods will return 0.0 when P and Q are the same
print(dh, df, dtw, pcm, cl, area)
The printed output is 0.107, 0.743, 37.69, 21.5, 6.86, 11.8 which quantify how different P is from Q according to each method.
You now have many methods to compare the two curves. I would start with DTW, since this has been used in many time series applications which look like the data you have uploaded.
We can visualize what P and Q look like with the following code.
plt.figure()
plt.plot(P[:, 0], P[:, 1])
plt.plot(Q[:, 0], Q[:, 1])
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With