Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What learning algorithm(s) should I consider to train a log-linear regression model?

I need to train a regression model over a large set of training examples, with the potential to incorporate arbitrary features. What learning algorithms should I consider and why?

A quick summary of the problem:

  • Approximately 5 million training examples
  • Adding training examples at a rate of 2-4 million per year
  • Training examples currently contain 10 features each
  • Aproximately 400k populated features (out of a much larger total feature space)
  • Additional features added over time
  • Retraining or adapting the model (at least) daily to incorporate new examples
  • Optimization criteria: minimum squared percentage error
  • Output: a single real-valued number

I have some experience training log-linear models on similarly-sized classification problems (using SVMs, Averaged and Voted Perceptrons, etc.) The ability to add arbitrary features is important, but in this instance, training time is valuable as well.

For instance, my one experiment thus far with SVMLight took several weeks to converge on a subset of this data. We could parallelize across a multicore machine or (possibly) a cluster, but we need to train models in minutes. Online training would be even better.

I trained an Averaged Perceptron model successfully (and quickly). However, to my knowledge, the AP isn't normally applied to regression. Does the AP offer any convergence guarantees for a regression model? Is there another formal reason it shouldn't be applicable? Or is that a reasonable match for my requirements?

What other options should I research? An SVM would probably offer superior accuracy, but quadratic training time isn't acceptable. If linear-time SVM algorithms are accessible, that could work well.

Potential pluses:

  • Online training
  • Open-source implementation available (ideally in Java). We can roll our own implementation if necessary, but I'll avoid that if possible.

Thanks for your input.

like image 297
AaronD Avatar asked Apr 24 '12 23:04

AaronD


People also ask

What is the algorithm used for linear regression?

Linear Regression is an ML algorithm used for supervised learning. Linear regression performs the task to predict a dependent variable(target) based on the given independent variable(s).

What are two ways to train a linear regression model?

In this post, we will discuss two different ways we can train a Linear Regression model. The first way directly computes the model parameters that best fit the model to the training set and the second computes it iteratively.


1 Answers

This is the classic problem with large scale SVM. An SVM model would need to be retrained if new features are added, and if new data is added if you are not using an online svm. Some options:

Practical Options (off the shelf):

LIBLINEAR - If you can do Linear SVM there are some algorithms which take advantage of the linear kernel to provide better than quadratic training time. Check out LIBLINEAR which is from the same research group as libsvm. They just added regression in Version 1.91 released yesterday. http://www.csie.ntu.edu.tw/~cjlin/liblinear/

Oracle ODM - Oracle has SVM available in their ODM package. They take a practical approach to basically provide 'good enough' SVM without paying the computational cost of finding a truly optimal solution. They use some sampling and model selection techniques - you can find info about that here: http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/overview/support-vector-machines-paper-1205-129825.pdf

SHOGUN - The SHOGUN Machine Learning Toolbox is designed for large-scale learning, they interface with a number of SVM implementations as well as other methods. I've never used it but it might be worth a look: http://www.shogun-toolbox.org

Kernel-machines.org has a list of software packages: http://www.kernel-machines.org/software

Other SVM research

If you are looking to roll your own, there are a number of techniques to try to scale SVM up to large datasets that have been published in research papers, but the code is not necessarily available, useable or maintained as the above examples. They claim good results, but each has its own set of drawbacks. Many involve doing some level of data selection. For example, several research papers use linear time clustering algorithms to cluster the data and train successive SVM models based on the clusters, in order to build the model without using all the data. Core Vector Machines claim a linear training time, but there is some criticism on whether their accuracy is as high as they claim. Numerous papers use different heuristic algorithms to attempt to select the most likely support vector candidates. Many of these are for classification, but could probably be adapted to regression. If you would like more info on some of this research I can add some references.

Tools for exploring algorithms

You are probably already be familiar with these, but I figured I'd throw it in here just in case:

There are other algorithms that have good runtime on large datasets, but whether they will work well is hard to tell, it depends on the makeup of your data. Since run-time is important I would start with the simpler models and work up to the more complex. ANN, Decision Tree regression, Bayesian methods, Locally Weighted Linear Regression, or a hybrid approach such as model trees, which is a decision tree whose leaf nodes are linear models, can all be done more quickly than SVM on large data sets and may produce good results.

WEKA - Weka is a good tool for exploring your options. I would use WEKA to try subsets of your data in different algorithms. The source code is open and in java if you chose something you could adapt it to your needs. http://www.cs.waikato.ac.nz/ml/weka/

R - The R Programming Language also implements many algorithms and is similar to programming in Matlab. http://www.r-project.org/

I wouldn't recommend using WEKA or R non a large-scale data set, but they are useful tools for trying to narrow down what algorithms may work well for you.

like image 139
karenu Avatar answered Oct 11 '22 14:10

karenu