i have 2 arrays of points (x,y), with those points I can draw 2 curves.
Anyone have ideas how to calculate how those curves are similar?
Popular answers (1) You need to calculate the cumulative distributions, then calculate the maximum distance between those. This is a relatively robust measure of (dis-)similarity between two distributions. The smaller the dissimilarity, the more the curves are "alike".
You can always calculate the area between those two curves. (This is a bit easier if the endpoints match.) The curves are similar if the area is small, not so similar if the area is not small.
Note that I did not define 'small'. That was intentional. Then again, you didn't define 'similar'.
Edit
Sometimes area isn't the best metric. For example consider the function f(x)=0 and f(x)=1e6*sin(x). If the range of x is some integral multiple of 2*pi, the area between these curves is zero. A function that oscillates between plus and minus one million is not a good approximation of f(x)=0.
A better metric is needed. Here are a couple. Note: I am assuming here that the x values are identical in the two sets; the only things that differ are the y values.
Sum of squares. For each x value, compute delta_yi = y1,i - y2,i and accumulate delta_yi2. This metric is the basis for a least square optimization, where the goal is to minimize the sum of the squares of the errors. This is a widely used approach because oftentimes it is fairly easy to implement.
Maximum deviation. Find the abs_delta_yi = |y1,i - y2,i| that maximizes the |y1,i - y2,i| for all x values. This metric is the basis for a lot of the implementations of the functions in the math library, where the goal is to minimize the maximum error. These math library implementations are approximations of the true function. As a consumer of such an approximation, I typically care more about the worst thing that the approximation is going to do to my application than I care about how that approximation is going to behave on average.
You might want to consider using Dynamic Time Warping (DTW) or Frechet distance.
Dynamic Time Warping sums the difference throughout the entire curve. It can handle two arrays of different sizes. Here is a snippet from Wikipedia on how the code might look. This solution uses a two-dimensional array. The cost would be the distance between two points. The final value of the array DTW[n, m] contains the cumulative distance.
int DTWDistance(s: array [1..n], t: array [1..m]) {
DTW := array [0..n, 0..m]
for i := 1 to n
DTW[i, 0] := infinity
for i := 1 to m
DTW[0, i] := infinity
DTW[0, 0] := 0
for i := 1 to n
for j := 1 to m
cost:= d(s[i], t[j])
DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion
DTW[i , j-1], // deletion
DTW[i-1, j-1]) // match
return DTW[n, m]
}
DTW is similar to Jacopson's answer.
Frechet distance calculates the farthest that the curves separate. This means that all other points on the curve are closer together than this distance. This approach is typically represented with a dog and owner as shown here: Frechet Distance Example.
Depending on your arrays, you can compare the distance of the points and use the maximum.
I assume a Curve is an array of 2D points over the real numbers, the size of the array is N
, so I call p[i]
the i
-th point of the curve; i
goes from 0
to N-1
.
I also assume that the two curves have the same size and that it is meaningful to "compare" the i
-th point of the first curve with the i
-th point of the second curve.
I call Delta
, a real number, the result of the comparison of the two curves.
Delta
can be computed as follow:
Delta = 0;
for( i = 0; i < N; i++ ) {
Delta = Delta + distance(p[i],q[i]);
}
where p
are points from the first curve and q
are points from the second curve.
Now you have to choose a suitable distance
function depending on your problem: the function has two points as arguments and returns a real number.
For example distance
can be the usual distance of two point on the plane (Pythagorean theorem and http://en.wikipedia.org/wiki/Euclidean_distance).
An example of the method in C++:
#include <numeric>
#include <vector>
#include <cmath>
#include <iostream>
#include <functional>
#include <stdexcept>
typedef double Real_t;
class Point
{
public:
Point(){}
Point(std::initializer_list<Real_t> args):x(args.begin()[0]),y(args.begin()[1]){}
Point( const Real_t& xx, const Real_t& yy ):x(xx),y(yy){}
Real_t x,y;
};
typedef std::vector< Point > Curve;
Real_t point_distance( const Point& a, const Point& b )
{
return hypot(a.x-b.x,a.y-b.y);
}
Real_t curve_distance( const Curve& c1, const Curve& c2 )
{
if ( c1.size() != c2.size() ) throw std::invalid_argument("size mismatch");
return std::inner_product( c1.begin(), c1.end(), c2.begin(), Real_t(0), std::plus< Real_t >(), point_distance );
}
int main(int,char**)
{
Curve c1{{0,0},
{1,1},
{2,4},
{3,9}};
Curve c2{{0.1,-0.1},
{1.1,0.9},
{2.1,3.9},
{3.1,8.9}};
std::cout << curve_distance(c1,c2) << "\n";
return 0;
}
If your two curves have different size then you have to think how to extend the previous method, for example you can reduce the size of the longest curve by means of a suitable algorithm (for example the Ramer–Douglas–Peucker algorithm can be a starting point) in order to match it to the size of the shortest curve.
I have just described a very simple method, you can also take different approaches; for example you can fit two curves to the two set of points and then work with the two curves expressed as mathematical function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With