The code below (to compute cosine similarity), when run repeatedly on my computer, will output 1.0, 0.9999999999999998, or 1.0000000000000002. When I take out the normalize function, it will only return 1.0. I thought floating point operations were supposed to be deterministic. What would be causing this in my program if the same operations are being applied on the same data on the same computer each time? Is it maybe something to do with where on the stack the normalize function is being called? How can I prevent this?
#! /usr/bin/env python3
import math
def normalize(vector):
sum = 0
for key in vector.keys():
sum += vector[key]**2
sum = math.sqrt(sum)
for key in vector.keys():
vector[key] = vector[key]/sum
return vector
dict1 = normalize({"a":3, "b":4, "c":42})
dict2 = dict1
n_grams = list(list(dict1.keys()) + list(dict2.keys()))
numerator = 0
denom1 = 0
denom2 = 0
for n_gram in n_grams:
numerator += dict1[n_gram] * dict2[n_gram]
denom1 += dict1[n_gram]**2
denom2 += dict2[n_gram]**2
print(numerator/(math.sqrt(denom1)*math.sqrt(denom2)))
Floating-point math may be deterministic, but the ordering of dictionary keys is not.
When you call .keys()
, the order of the resulting list is potentially random.
Thus the order of your math operations inside the loops are also potentially random, and thus the result is not going to be deterministic because while any single floating-point operation might be deterministic, the result of a series of operations is very much dependent on ordering.
You could enforce a consistent order by sorting your key lists.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With