I am trying to train a neural network (Tensorflow) on AWS. I have some AWS credits. From my understanding AWS SageMaker is the one best for the job. I managed to load the Jupyter Lab console on SageMaker and tried to find a GPU kernel since, I know it is the best for training neural networks. However, I could not find such kernel.
Would anyone be able to help in this regard.
Thanks & Best Regards
Michael
Train deep learning models with the fastest GPU instances in the cloud. You can use Amazon SageMaker to easily train deep learning models on Amazon EC2 P3 instances, the fastest GPU instances in the cloud.
AWS and NVIDIA ServicesP4d instances are powered by the latest NVIDIA A100 Tensor Core GPUs and deliver industry-leading high throughput and low latency networking.
Amazon SageMaker pre-built deep learning framework containers now support TensorFlow 1.5 and Apache MXNet 1.0, both of which take advantage of CUDA 9 optimizations for faster performance on SageMaker ml.
Amazon SageMaker creates a fully managed ML instance in Amazon Elastic Compute Cloud (EC2). It supports the open source Jupyter Notebook web application that enables developers to share live code. SageMaker runs Jupyter computational processing notebooks.
You train models on GPU in the SageMaker ecosystem via 2 different components:
You can instantiate a GPU-powered SageMaker Notebook Instance, for example p2.xlarge
(NVIDIA K80) or p3.2xlarge
(NVIDIA V100). This is convenient for interactive development - you have the GPU right under your notebook and can run code on the GPU interactively and monitor the GPU via nvidia-smi
in a terminal tab - a great development experience. However when you develop directly from a GPU-powered machine, there are times when you may not use the GPU. For example when you write code or browse some documentation. All that time you pay for a GPU that sits idle. In that regard, it may not be the most cost-effective option for your use-case.
Another option is to use a SageMaker Training Job running on a GPU instance. This is a preferred option for training, because training metadata (data and model path, hyperparameters, cluster specification, etc) is persisted in the SageMaker metadata store, logs and metrics stored in Cloudwatch and the instance automatically shuts down itself at the end of training. Developing on a small CPU instance and launching training tasks using SageMaker Training API will help you make the most of your budget, while helping you retain metadata and artifacts of all your experiments. You can see here a well documented TensorFlow example
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With