I'm currently experimenting with a ML task that involves supervised training of a classification model. To date, I've got ~5M training examples and ~5M examples for cross-validation. Each example has, at the moment, 46 features, however I might want to generate 10 more in the near future, so any solution should leave some room for improvement.
My problem is the following: what tool do I use to tackle this problem? I'd like to use random forests or SVM, however I'm afraid that the latter might be too slow in my case. I've considered Mahout, but turned away as it appears to require a certain amount of configuration coupled with messing with command line scripts. I'd rather code directly against some (well documented!) library or define my model with a GUI.
I should also specify that I'm looking for something that will run on Windows (without things such as cygwin), and that solutions that play well with .NET are much appreciated.
You can imagine that, when the time will, come, the code will be run on a Cluster Compute Eight Extra Large Instance on Amazon EC2, so anything that makes wide use of RAM and multi-core CPUs is welcome.
Last but not least, I shall specify that my dataset is dense (in that there's no missing value / all columns have a value for each vector)
TensorFlow It is much popular among machine learning enthusiasts, and they use it for building different ML applications. It offers a powerful library, tools, and resources for numerical computation, specifically for large scale machine learning and deep learning projects.
A machine learning platform provides capabilities to complete a machine learning project from beginning to end. Namely, some data analysis, data preparation, modeling and algorithm evaluation and selection.
The three machine learning types are supervised, unsupervised, and reinforcement learning.
I routinely run similar row/feature count datasets in R
on EC2 (the 16 core / 60 Gb instance type you are referring to is particularly useful in case if you are using a method that can take advantage of multiple cpus such as package caret
.) As you've mentioned, not all learning methods (such as SVM) are going to perform well on such dataset though.
You may want to consider using a 10% sample or so for quick prototyping / performance benchmarking before switching to running on the full dataset.
If you want extremely high performance then Vowpal Wabbit is a better fit (but it only supports generalized linear learners so no gbm
or Random Forest
.) Besides, VW is not very windows-friendly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With