My feature vector has both continuous (or widely ranging) and binary components. If I simply use Euclidean distance, the continuous components will have a much greater impact:
Representing symmetric vs. asymmetric as 0 and 1 and some less important ratio ranging from 0 to 100, changing from symmetric to asymmetric has a tiny distance impact compared to changing the ratio by 25.
I can add more weight to the symmetry (by making it 0 or 100 for example), but is there a better way to do this?
You could try using the normalized Euclidean distance, described, for example, at the end of the first section here.
It simply scales every feature (continuous or discrete) by its standard deviation. This is more robust than, say, scaling by the range (max-min
) as suggested by another poster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With