Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there some kind of persistent local storage in aws sagemaker model training?

I did some experimentation with aws sagemaker, and the download time of large data sets from S3 is very problematic, especially when the model is still in development, and you want some kind of initial feedback relatively fast

Is there some kind of local storage or other way to speed things up?

EDIT I refer to the batch training service, that allows you to submit a job as a docker container.

While this service is intended for already validated jobs that typically run for a long time (which makes the download time less significant) there's still a need for quick feedback

  1. There's no other way to do the "integration" testing of your job with the sagemaker infrastructure (configuration files, data files, etc.)

  2. When experimenting with different variations to the model, it's important to be able to get initial feedback relatively fast

like image 291
Ophir Yoktan Avatar asked Jan 18 '18 11:01

Ophir Yoktan


Video Answer


1 Answers

SageMaker has a few distinct services in it, and each is optimized for a specific use case. If you are talking about the development environment, you are probably using the notebook service. The notebook instance is coming with a local EBS (5GB) that you can use to copy some data into it and run the fast development iterations without copying the data every time from S3. The way to do it is by running wget or aws s3 cp from the notebook cells or from the terminal that you can open from the directory list page.

Nevertheless, it is not recommended to copy too much data into the notebook instance, as it will cause your training and experiments to take too long. Instead, you should utilize the second part of SageMaker, which is the training service. Once you have a good sense of the model that you want to train, based on the quick iterations of the small datasets on the notebook instance, you can point your model definition to go over larger datasets in parallel across a cluster of training instances. When you are sending a training job, you can also define how much local storage will be used by each training instance, but you will most benefit from the distributed mode of the training.

When you want to optimize your training job you have a few options for the storage. First, you can define the size of the EBS volume that you want your model to train on, for each one of the cluster instances. You can specify it when you launch the training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html ):

...
   "ResourceConfig": { 
      "InstanceCount": number,
      "InstanceType": "string",
      "VolumeKmsKeyId": "string",
      "VolumeSizeInGB": number
   },
...

Next, you need to decide what kind of models you want to train. If you are training your own models, you know how these models are getting their data, in terms of format, compression, source and other factors that can impact the performance of loading that data into the model input. If you prefer to use the built-in algorithms that SageMaker has, which are optimized to process protobuf RecordIO format. See more information here: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

Another aspect that you can benefit from (or learn if you want to implement your own models in a more scalable and optimized way) is the TrainingInputMode (https://docs.aws.amazon.com/sagemaker/latest/dg/API_AlgorithmSpecification.html#SageMaker-Type-AlgorithmSpecification-TrainingInputMode):

Type: String

Valid Values: Pipe | File

Required: Yes

You can use the File mode to read the data files from S3. However, you can also use the Pipe mode which opens up a lot of options to process data in a streaming mode. It doesn't mean only real-time data, using streaming services such as AWS Kinesis or Kafka, but also you can read your data from S3 and stream it to the models, and completely avoid the need to store the data locally on the training instances.

like image 160
Guy Avatar answered Sep 28 '22 01:09

Guy