Machine learning task: what tool to use?

Tags:

I'm currently experimenting with a ML task that involves supervised training of a classification model. To date, I've got ~5M training examples and ~5M examples for cross-validation. Each example has, at the moment, 46 features, however I might want to generate 10 more in the near future, so any solution should leave some room for improvement.

My problem is the following: what tool do I use to tackle this problem? I'd like to use random forests or SVM, however I'm afraid that the latter might be too slow in my case. I've considered Mahout, but turned away as it appears to require a certain amount of configuration coupled with messing with command line scripts. I'd rather code directly against some (well documented!) library or define my model with a GUI.

I should also specify that I'm looking for something that will run on Windows (without things such as cygwin), and that solutions that play well with .NET are much appreciated.

You can imagine that, when the time will, come, the code will be run on a Cluster Compute Eight Extra Large Instance on Amazon EC², so anything that makes wide use of RAM and multi-core CPUs is welcome.

Last but not least, I shall specify that my dataset is dense (in that there's no missing value / all columns have a value for each vector)

989

asked Dec 24 '11 10:12

em70

1 Answers

I routinely run similar row/feature count datasets in R on EC2 (the 16 core / 60 Gb instance type you are referring to is particularly useful in case if you are using a method that can take advantage of multiple cpus such as package caret.) As you've mentioned, not all learning methods (such as SVM) are going to perform well on such dataset though.

You may want to consider using a 10% sample or so for quick prototyping / performance benchmarking before switching to running on the full dataset.

If you want extremely high performance then Vowpal Wabbit is a better fit (but it only supports generalized linear learners so no gbm or Random Forest.) Besides, VW is not very windows-friendly.

189

answered Nov 14 '22 16:11

Yevgeny

Related questions
                            
                                Is it possible to use Amazon ELB in TCP mode to spread websocket connections across multiple Tomcat-based websocket servers?
                            
                                Share Folder (SMB) from EC2 Instance on AWS to remote machine
                            
                                Impossible to install bcrypt with npm on EC2
                            
                                Using IDE on AWS EC2
                            
                                How to install audiowaveform program on AWS Elastic Beanstalk
                            
                                How to stop EMR Cluster without terminating it?
                            
                                AWS CodeDeploy can't find github tar link for private repository "could not download bundle"
                            
                                PDF file creation using html-pdf is not working in my deployment server?
                            
                                amazon_es output plugin installation error on an ec2 instance?
                            
                                Spark Local Mode - all jobs only use one CPU core
                            
                                Using downloaded NLTK data on AWS Elastic Beanstalk
                            
                                Ansible: Shared connection to xxx closed
                            
                                Changes to inbound rules are not updating to instance
                            
                                Alternatives to StarCluster?
                            
                                Unable to get parameters in Parameter Store aws
                            
                                Issues with mongoRestore [listCollections requires authentication]
                            
                                Create a UDP Load Balancer with Failover at Amazon for EC2 Instances
                            
                                How to manage releases with ASP.NET pointing to new versions of Webpacked JS files?
                            
                                Unable to connect to EC2 Linux instance from Windows 10 CMD using ssh
                            
                                Login loop with Spring Security requires-channel and Amazon Elastic Load Balancer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Machine learning task: what tool to use?

Tags:

machine-learning

amazon-ec2

classification

cloud

em70

People also ask

1 Answers

Yevgeny

Recent Activity

Donate For Us