AWS SageMaker on GPU

Tags:

I am trying to train a neural network (Tensorflow) on AWS. I have some AWS credits. From my understanding AWS SageMaker is the one best for the job. I managed to load the Jupyter Lab console on SageMaker and tried to find a GPU kernel since, I know it is the best for training neural networks. However, I could not find such kernel.

Would anyone be able to help in this regard.

Thanks & Best Regards

Michael

294

asked Mar 26 '20 13:03

Schroter Michael

1 Answers

You train models on GPU in the SageMaker ecosystem via 2 different components:

You can instantiate a GPU-powered SageMaker Notebook Instance, for example p2.xlarge (NVIDIA K80) or p3.2xlarge (NVIDIA V100). This is convenient for interactive development - you have the GPU right under your notebook and can run code on the GPU interactively and monitor the GPU via nvidia-smi in a terminal tab - a great development experience. However when you develop directly from a GPU-powered machine, there are times when you may not use the GPU. For example when you write code or browse some documentation. All that time you pay for a GPU that sits idle. In that regard, it may not be the most cost-effective option for your use-case.
Another option is to use a SageMaker Training Job running on a GPU instance. This is a preferred option for training, because training metadata (data and model path, hyperparameters, cluster specification, etc) is persisted in the SageMaker metadata store, logs and metrics stored in Cloudwatch and the instance automatically shuts down itself at the end of training. Developing on a small CPU instance and launching training tasks using SageMaker Training API will help you make the most of your budget, while helping you retain metadata and artifacts of all your experiments. You can see here a well documented TensorFlow example

144

answered Oct 24 '22 03:10

Olivier Cruchant

Related questions
                            
                                stopping an linux aws instance from linux command line
                            
                                Copying CSV to Amazon RDS hosted Postgresql database
                            
                                AWS EC2 email sending limit when using third party smtp server
                            
                                ClusterID vs JobFlowID on AWS EMR
                            
                                How to set up autoscaling RabbitMQ Cluster AWS
                            
                                AWS Elasticache - increase memcached item size limit
                            
                                ElasticBeanstalk - Adding ec2-user to another group
                            
                                Amazon S3 - Multiple keys to one object
                            
                                Cannot allocate memory: fork: Unable to fork new process on aws
                            
                                Is it possible to have an AWS EC2 scale group that defaults to 0 and only contains instances when there is work to do?
                            
                                AWS CLI retrieve the ARN of the newly created policy
                            
                                Serverless plugin "serverless-offline" not found. Make sure it's installed and listed in the "plugins" section of your serverless config file
                            
                                Where to look-up what exceptions a BOTO3 function can throw?
                            
                                How to set AWS credentials with .net Core
                            
                                Can we update messages in AWS SQS FIFO Queue?
                            
                                Unknown service: 'secretsmanager' or AWS Secrets Manager service is not in the list of AWS CLI
                            
                                AWS Cloudformation : Passing environmental variables as parameters to lambda functions
                            
                                s3 SignedUrl x-amz-security-token
                            
                                AWS CodeCommit: Repository Notifications vs Repository Triggers
                            
                                AWS DMS replication instance out of memory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS SageMaker on GPU

Tags:

amazon-web-services

tensorflow

amazon-sagemaker

Schroter Michael

People also ask

1 Answers

Olivier Cruchant

Recent Activity

Donate For Us