Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Right way to compute cosine similarity between two arrays?

I am working on a project that detects some features of two input images(handwritten signatures) and compares those two features using cosine similarity. Here When I mean two input images, one is an original image, and other is duplicate image. Say I am extracting 15 such features of one image(original image) and storing it in one array(Say, Array_ORG), and features of other image is stored in Array_DUP similarly. Now, I am trying to calculate the cosine similarity between these two arrays. These arrays are of double datatype.

I am listing down two methods that I followed:

1)Manual calculation of cosine similarity:

main(){

for(int i=0;i<15;i++)
    sum_org += (Array_org[i]*Array_org[i]);
for(int i=0;i<15;i++)
    sum_dup += (Array_dup[i]*Array_dup[i]);
double magnitude = sqrt(sum_org +sum_dup );
double cosine_similarity = dot_product(Array_org, Array_dup, sizeof(Array_org)/sizeof(Array_org[0]))/magnitude;
}

double dot_product(double *a, double* b, size_t n){
double sum = 0;
    size_t i;

    for (i = 0; i < n; i++) {
            sum += a[i] * b[i];
    }

    return sum;
}

2)Storing the values into a Mat and calling dot function:

Mat A = Mat(1,15,CV_32FC1,&Array_org);
Mat B = Mat(1,15,CV_32FC1,&Array_dup);
double similarity = cal_theta(A,B);

double cal_theta(Mat A, Mat B){
double ab = A.dot(B);
double aa = A.dot(A);
double bb = B.dot(B);
return -ab / sqrt(aa*bb);
}

I have read that cosine similarity value ranges from -1 to 1, with -1 saying both are exxactly opposite, and 1, saying both are equal. But first function gives me values in 1000's and second function gives me values more than 1.
Please guide me which process is right, and why? Also how do I infer the similarity if cosine similarity values are more than 1?

like image 514
Shruthi Kodi Avatar asked May 22 '15 19:05

Shruthi Kodi


1 Answers

The correct definition of cosine similarity is :

enter image description here

Your code does not compute the denominator, hence the values are wrong.

double cosine_similarity(double *A, double *B, unsigned int Vector_Length)
{
    double dot = 0.0, denom_a = 0.0, denom_b = 0.0 ;
     for(unsigned int i = 0u; i < Vector_Length; ++i) {
        dot += A[i] * B[i] ;
        denom_a += A[i] * A[i] ;
        denom_b += B[i] * B[i] ;
    }
    return dot / (sqrt(denom_a) * sqrt(denom_b)) ;
}
like image 146
a_pradhan Avatar answered Sep 27 '22 15:09

a_pradhan