Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing classification algorithm to classify mix of nominal and numeric data?

I have a dataset of about 100,000 records about buying pattern of customers. The data set contains

  • Age (continuous value from 2 to 120) but I have plan also to categorize into age ranges.
  • Gender (either 0 or 1)
  • Address (can be only six types or I can also represent using numbers from 1 to 6)
  • Preference shop (can be from only 7 shops) which is my class problem.

So my problem is to classify and predict the customers based on their Age,gender and location for Preference shop. I have tried to use naive and decision trees but their classification accuracy is little bit low below.

I am thinking also logistic regression but I am not sure about the discrete value like gender and address. But, I have also assumed SVM with some kernal tricks but not yet tried.

So which machine learning algorithm do you suggest for better accuracy with these features.

like image 317
Tad Avatar asked Dec 15 '22 15:12

Tad


2 Answers

The issue is that you're representing nominal variables on a continuous scale, which imposes a (spurious) ordinal relationship between classes when you use machine learning methods. For example, if you code address as one of six possible integers, then address 1 is closer to address 2 than it is to address 3,4,5,6. This is going to cause problems when you try to learn anything.

Instead, translate your 6-value categorical variable to six binary variables, one for each categorical value. Your original feature will then give rise to six features, where only one will ever be on. Also, keep the age as an integer value since you lose information by making it categorical.

As for approaches, it's unlikely to make much of a difference (at least initially). Go with whichever is easier for you to implement. However, make sure you run some sort of cross-validation parameter selection on a dev set before running on your test set, as all algorithms have parameters than can dramatically affect learning accuracy.

like image 61
Ben Allison Avatar answered Jan 12 '23 19:01

Ben Allison


You really need to look at the data and determine if there is enough variance between your labels and the features that you currently have. Because there are so few features but a lot of data, something such as kNN could work well.

You could adapt collaborative filtering to solve your problem as that would also work off of similar features.

like image 27
Steve Avatar answered Jan 12 '23 20:01

Steve