Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use both binary and continuous features in the k-Nearest-Neighbor algorithm?

My feature vector has both continuous (or widely ranging) and binary components. If I simply use Euclidean distance, the continuous components will have a much greater impact:

Representing symmetric vs. asymmetric as 0 and 1 and some less important ratio ranging from 0 to 100, changing from symmetric to asymmetric has a tiny distance impact compared to changing the ratio by 25.

I can add more weight to the symmetry (by making it 0 or 100 for example), but is there a better way to do this?

like image 475
John Hall Avatar asked Feb 04 '23 00:02

John Hall


1 Answers

You could try using the normalized Euclidean distance, described, for example, at the end of the first section here.

It simply scales every feature (continuous or discrete) by its standard deviation. This is more robust than, say, scaling by the range (max-min) as suggested by another poster.

like image 104
NPE Avatar answered Apr 27 '23 01:04

NPE