I am running some machine learning algorithms on EMR Spark cluster. I am curious about which kind of instance to use so I can get the optimal cost/performance gain? For the same level of prices, I can choose among: <pre class="prettyprint"><code> vCPU ECU Memory(GiB) m3.xlarge 4 13 15 c4.xlarge 4 16 7.5 r3.xlarge 4 13 30.5 </code></pre> Which kind of instance should be used in EMR Spark cluster?

Generally speaking, it depends on your use case, needs, etc... But I can suggest a minimum configuration considering the information that you have shared. You seem to be trying to train an <code>ALS</code> factorization or <code>SVD</code> on matrices between 2 ~ 4 GBs of data. So actually that's not too much of data. You'll be needing at least 1 master and 2 nodes to setup and configure a small distributed cluster. The master won't be doing any computing whatsoever so it won't need much resources but of course I would be dealing task scheduling, etc. You can add slaves (instances) according to your needs. <ul> <li>1 x master : <strike>m3.xlarge</strike> m5.xlarge - vCPU : 4 , RAM : 16 GB with EBS storage.</li> <li>2 x slaves : <strike>c3.4xlarge</strike> c5.xlarge - vCPU : 16, RAM : 32 GB with EBS storage.</li> </ul> EDIT : As mentioned in the comments, 5th generation instances are now available for each of the instance types mentioned in this thread: R5, M5, and C5. In general, latest-generation instance types are cheaper and more performant than their older counterparts. C3, C4, and C5 are compute optimized instances featuring high performance processors and with a lowest price/compute performance in EC2 compared to R3, R4 or R5 although it's recommended use cases are distributed memory caches and in-memory analytics. But C5 will do the job for you for a lower price. <blockquote> Performance Optimizations : <ul> <li>Amazon EMR charges on hourly increments. This means once you run a cluster, you are paying for the entire hour. That's important to remember because if you are paying for a full hour of Amazon EMR cluster, improving your data processing time by matter of minutes may not be worth your time and effort.</li> <li>Don't forget that adding more nodes to increase performance is cheaper than spending time optimizing your cluster.</li> </ul> Reference : Amazon EMR Best Practices - Parviz Deyhim. </blockquote> EDIT : You might also consider enabling Ganglia to monitor your cluster resources: CPU, RAM, Network I/O. This would help you also tuning your EMR cluster. Practically, you don't have any configuration to do. Just follow the documentation to add it to your EMR cluster on creation.

Generally speaking the preferred instance depends on the job you are running (is it memory intensive? is it CPU intensive? etc.) However Spark is very memory intensive and I wouldn't use machines with less than 30Gb for most jobs. In your particular case (4Gb dataset) I am not sure why you'd want to use distributed computing to begin with- it will just make your job run slow. If you are sure you want spark run it in local mode with X threads (depending on how many cores you have)

Spark - Which instance type is preferred for AWS EMR cluster? [closed]

Tags:

amazon-ec2

apache-spark

emr

I am running some machine learning algorithms on EMR Spark cluster. I am curious about which kind of instance to use so I can get the optimal cost/performance gain?

For the same level of prices, I can choose among:

          vCPU  ECU  Memory(GiB)
m3.xlarge  4     13     15     
c4.xlarge  4     16      7.5
r3.xlarge  4     13     30.5

Which kind of instance should be used in EMR Spark cluster?

406

asked May 25 '15 09:05

shihpeng

2 Answers

Generally speaking, it depends on your use case, needs, etc... But I can suggest a minimum configuration considering the information that you have shared.

You seem to be trying to train an ALS factorization or SVD on matrices between 2 ~ 4 GBs of data. So actually that's not too much of data.

You'll be needing at least 1 master and 2 nodes to setup and configure a small distributed cluster. The master won't be doing any computing whatsoever so it won't need much resources but of course I would be dealing task scheduling, etc.

You can add slaves (instances) according to your needs.

1 x master : ~~m3.xlarge~~ m5.xlarge - vCPU : 4 , RAM : 16 GB with EBS storage.
2 x slaves : ~~c3.4xlarge~~ c5.xlarge - vCPU : 16, RAM : 32 GB with EBS storage.

EDIT : As mentioned in the comments, 5th generation instances are now available for each of the instance types mentioned in this thread: R5, M5, and C5. In general, latest-generation instance types are cheaper and more performant than their older counterparts.

C3, C4, and C5 are compute optimized instances featuring high performance processors and with a lowest price/compute performance in EC2 compared to R3, R4 or R5 although it's recommended use cases are distributed memory caches and in-memory analytics. But C5 will do the job for you for a lower price.

Performance Optimizations :

Amazon EMR charges on hourly increments. This means once you run a cluster, you are paying for the entire hour. That's important to remember because if you are paying for a full hour of Amazon EMR cluster, improving your data processing time by matter of minutes may not be worth your time and effort.

Don't forget that adding more nodes to increase performance is cheaper than spending time optimizing your cluster.

Reference : Amazon EMR Best Practices - Parviz Deyhim.

EDIT : You might also consider enabling Ganglia to monitor your cluster resources: CPU, RAM, Network I/O. This would help you also tuning your EMR cluster. Practically, you don't have any configuration to do. Just follow the documentation to add it to your EMR cluster on creation.

answered Oct 18 '22 04:10

eliasah

Generally speaking the preferred instance depends on the job you are running (is it memory intensive? is it CPU intensive? etc.) However Spark is very memory intensive and I wouldn't use machines with less than 30Gb for most jobs.

In your particular case (4Gb dataset) I am not sure why you'd want to use distributed computing to begin with- it will just make your job run slow. If you are sure you want spark run it in local mode with X threads (depending on how many cores you have)

answered Oct 18 '22 02:10

Arnon Rotem-Gal-Oz

Related questions
                            
                                Can't get my EC2 Windows Server 2008 (Web stack) instance to receive publishings of my website
                            
                                Getting ID of an instance newly launched with ec2-api-tools
                            
                                EC2 security group vs IAM Group?
                            
                                Iterate thru ec2 describe instance boto3
                            
                                How can I install xclip on an EC2 instance?
                            
                                How to install Tomcat in Amazon Web Services EC2
                            
                                How do I find the OS of a running EC2 instance?
                            
                                htaccess works in localhost but doesn't work in EC2 instance
                            
                                Update wordpress theme on ec2
                            
                                How do the amazon web services work?
                            
                                Amazon EC2 On-Demand Workers for Short Tasks
                            
                                AWS Amazon IAM user Policy to access ONLY one EC2 instance on EU-WEST-1 region
                            
                                Amazon linux AMI vs Ubuntu
                            
                                EC2. Load balancer. At least two subnets must be specified
                            
                                'gcc' failed during pandas build on AWS Elastic Beanstalk
                            
                                EC2: How to Clone Git Repository
                            
                                RabbitMQ settings disappear on restart. Why?
                            
                                ssh with private pem key not possible (dlopen image not found) [closed]
                            
                                Is it possible to move EC2 volumes to Amazon Glacier without having to download and upload it?
                            
                                Run Java EE app on EC2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With