Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java, Weka: How to predict numeric attribute?

I was trying to use NaiveBayesUpdateable classifier from Weka. My data contains both nominal and numeric attributes:

  @relation cars
  @attribute country {FR, UK, ...}
  @attribute city {London, Paris, ...}
  @attribute car_make {Toyota, BMW, ...}
  @attribute price numeric   %% car price 
  @attribute sales numeric   %% number of cars sold

I need to predict the number of sales (numeric!) based on other attributes.

I understand that I can not use numeric attribute for Bayes classification in Weka. One technique is to split value of numeric attribute in N intervals of length k and use instead nominal attribute, where n is a class name, like this: @attribute class {1,2,3,...N}.

Yet numeric attribute that I need to predict ranges from 0 to 1 000 000. Creating 1 000 000 classes make no sense at all. How to predict numeric attribute with Weka or what algorithms to look for in case Weka has no tools for this task?

like image 542
Anton Ashanin Avatar asked Apr 25 '13 19:04

Anton Ashanin


2 Answers

What you want to do is regression, not classification. The difference is exactly what you describe/want:

  • Classification has discrete classes/labels, any nominal attribute could be used as class here
  • Regression has continuous labels, classes would be a wrong term here.

Most regression based techniques can be transformed into a binary classification by defining a threshold and the class is determined by whether the predicted value is above or below this threshold.

I don't know all of WEKA's classifiers that offer regression, but you can start by looking at those two:

  • MultilayerPerceptron: Basically a neural network.
  • LinearRegression: As the name says, linear regression.

You might have to use the NominalToBinary filter to convert your nominal attributes to numerical (binary) ones.

like image 119
Sentry Avatar answered Sep 21 '22 07:09

Sentry


These days, I believe first introduced in Weka 3.7, RandomForest would work just as you want it. The features can be a mix of nominal and numeric and the prediction is allowed to be numeric as well.

The drawback (I would imagine in your case) is that it is not an Updateable class as NaiveBayesUpdateable works well with large amounts of data that may not fit in memory all at once.

like image 21
demongolem Avatar answered Sep 22 '22 07:09

demongolem