KNN classification with categorical data

Tags:

I'm busy working on a project involving k-nearest neighbour regression. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).

How do I go about incorporating categorical values into the KNN analysis?

As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers - is this a feasible method? Advice would be much appreciated.

912

asked Nov 29 '12 12:11

Graham

2 Answers

You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.

There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.

In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.

There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.

There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012

answered Sep 25 '22 03:09

Has QUIT--Anony-Mousse

The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.

answered Sep 25 '22 03:09

Shai

Related questions
                            
                                Is Matlab (vs. C/FORTRAN) a respectable language for a professional mathematical researcher of the 21st century? [closed]
                            
                                Create variables with names from strings
                            
                                How to benchmark Matlab processes?
                            
                                How to build mex file directly in Visual Studio?
                            
                                Return Unique Element with a Tolerance
                            
                                Using Matlab to import another .m file
                            
                                How to convert the integer date format into YYYYMMDD?
                            
                                How can I save a very large MATLAB sparse matrix to a text file?
                            
                                How to plot 3D grid (cube) in Matlab
                            
                                How to select a submatrix (not in any particular pattern) in Matlab
                            
                                Detecting an object (words) in an image
                            
                                How to work with Unix timestamps in Matlab?
                            
                                Initialize empty matrix in Python
                            
                                MATLAB 'spectrogram' params
                            
                                MATLAB hangs when I try to use the java package jdde, but only for the first time after a system reboot
                            
                                why in matlab sin(pi) is not exact but sin(pi/2) is exact?
                            
                                matlab remove only top and right ticks with leaving box on
                            
                                Transform Image using Roll-Pitch-Yaw angles (Image rectification)
                            
                                Increase the performance by removing CLEAR ALL
                            
                                C# Process Start needs Arguments with double quotes - they disappear

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

KNN classification with categorical data

Tags:

ordinal

classification

matlab

octave

knn

Graham

People also ask

2 Answers

Has QUIT--Anony-Mousse

Shai

Recent Activity

Donate For Us