I was playing around with different implementations of the Euclidean distance metric and I noticed that I get different results for Scipy, pure Python, and Java.
Here's how I compute the distance using Scipy (= option 1):
distance = scipy.spatial.distance.euclidean(sample, training_vector)
here's an implementation in Python I found in a forum (option 2):
distance = math.sqrt(sum([(a - b) ** 2 for a, b in zip(training_vector, sample)]))
and lastly, here's my implementation in Java (option 3):
public double distance(int[] a, int[] b) {
assert a.length == b.length;
double squaredDistance = 0.0;
for(int i=0; i<a.length; i++){
squaredDistance += Math.pow(a[i] - b[i], 2.0);
}
return Math.sqrt(squaredDistance);
}
Both sample
and training_vector
are 1-D arrays with length 784, taken from the MNIST dataset. I tried all three methods with the same sample
and training_vector
. The problem is that the three different methods result in three significantly different distances (that is, around 1936 for option 1, 1914 for option 2, and 1382 for option 3). Interestingly, when I use the same argument order for sample
and training_vector
in options 1 and 2 (i.e. flip the arguments to option 1 around), I get the same result for these two options. But distance metrics are supposed to be symmetrical, right...?
What's also interesting: I'm using these metrics for a k-NN classifier for the MNIST dataset. My Java implementation yields an accuracy of around 94% for 100 test samples and 2700 training samples. However, the Python implementation using option 1 only yields an accuracy of about 75%...
Do you have any ideas as to why I'm getting these different results? If you are interested, I can post a CSV for two arrays online, and post a link here.
I'm using Java 8, Python 2.7, and Scipy 1.0.0.
Edit: Changed option 2 to
distance = math.sqrt(sum([(float(a) - float(b)) ** 2 for a, b in zip(training_vector, sample)]))
This had the following effects:
So, this only leaves the following problem: why is the result different (i.e. wrong?) when using SciPy?
Okay, I found the solution: I had imported both the training and test dataset using pandas with dtype=np.uint8
. Consequently, sample
and training_vector
were both numpy arrays with type uint8
. I changed the data type to np.float32
and now all my three options give the same results. I also tried np.uint32
and it works as well.
I'm not quite sure why, but apparently, SciPy doesn't give the "expected" result when working with uint8
. Maybe there was some internal overflow in SciPy? Not quite sure, but at least it works now. Thanks to everyone who helped!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With