Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbour regression. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).

How do I go about incorporating categorical values into the KNN analysis?

As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers - is this a feasible method? Advice would be much appreciated.

like image 912
Graham Avatar asked Nov 29 '12 12:11

Graham


People also ask

Why KNN Cannot deal with factor variables directly?

Among the three classification methods, only Kernel Density Classification can handle the categorical variables in theory, while kNN and SVM are unable to be applied directly since they are based on the Euclidean distances.

Is it possible to use the KNN classifier to classify nominal data?

Yes, it is generally possible in several ways. One approach is binarization of nominal attributes and it has already been discussed. But, if you use kNN distance measure (metrics) appropriate for nominal data, you can use original data with nominal attribute values without any changes.

Can KNN be used for binary variables?

Yes, you certainly can use KNN with both binary and continuous data, but there are some important considerations you should be aware of when doing so.

Can you use KNN for text classification?

KNN is a very popular algorithm for text classification. This paper presents the possibility of using KNN algorithm with TF-IDF method and framework for text classification. Framework enables classification according to various parameters, measurement and analysis of results.


2 Answers

You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.

There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.

In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.

There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.

There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012

like image 54
Has QUIT--Anony-Mousse Avatar answered Sep 25 '22 03:09

Has QUIT--Anony-Mousse


The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.

like image 25
Shai Avatar answered Sep 25 '22 03:09

Shai