Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Euclidean distance, different results between Scipy, pure Python, and Java

I was playing around with different implementations of the Euclidean distance metric and I noticed that I get different results for Scipy, pure Python, and Java.

Here's how I compute the distance using Scipy (= option 1):

distance = scipy.spatial.distance.euclidean(sample, training_vector)

here's an implementation in Python I found in a forum (option 2):

distance = math.sqrt(sum([(a - b) ** 2 for a, b in zip(training_vector, sample)]))

and lastly, here's my implementation in Java (option 3):

public double distance(int[] a, int[] b) {
    assert a.length == b.length;
    double squaredDistance = 0.0;
    for(int i=0; i<a.length; i++){
        squaredDistance += Math.pow(a[i] - b[i], 2.0);
    }
    return Math.sqrt(squaredDistance);
}

Both sample and training_vector are 1-D arrays with length 784, taken from the MNIST dataset. I tried all three methods with the same sample and training_vector. The problem is that the three different methods result in three significantly different distances (that is, around 1936 for option 1, 1914 for option 2, and 1382 for option 3). Interestingly, when I use the same argument order for sample and training_vector in options 1 and 2 (i.e. flip the arguments to option 1 around), I get the same result for these two options. But distance metrics are supposed to be symmetrical, right...?

What's also interesting: I'm using these metrics for a k-NN classifier for the MNIST dataset. My Java implementation yields an accuracy of around 94% for 100 test samples and 2700 training samples. However, the Python implementation using option 1 only yields an accuracy of about 75%...

Do you have any ideas as to why I'm getting these different results? If you are interested, I can post a CSV for two arrays online, and post a link here.

I'm using Java 8, Python 2.7, and Scipy 1.0.0.

Edit: Changed option 2 to

distance = math.sqrt(sum([(float(a) - float(b)) ** 2 for a, b in zip(training_vector, sample)]))

This had the following effects:

  • it got rid of a ubyte overflow warning (I must have missed this warning before...)
  • changing the argument order for options 1 and 2 no longer makes a difference.
  • the results for options 2 (pure Python) and 3 (Java) are now equal

So, this only leaves the following problem: why is the result different (i.e. wrong?) when using SciPy?

like image 675
Silas Berger Avatar asked Feb 28 '18 12:02

Silas Berger


1 Answers

Okay, I found the solution: I had imported both the training and test dataset using pandas with dtype=np.uint8. Consequently, sample and training_vector were both numpy arrays with type uint8. I changed the data type to np.float32 and now all my three options give the same results. I also tried np.uint32 and it works as well.

I'm not quite sure why, but apparently, SciPy doesn't give the "expected" result when working with uint8. Maybe there was some internal overflow in SciPy? Not quite sure, but at least it works now. Thanks to everyone who helped!

like image 119
Silas Berger Avatar answered Nov 15 '22 11:11

Silas Berger