I've just started to experiment with AWS SageMaker and would like to load data from an S3 bucket into a pandas dataframe in my SageMaker python jupyter notebook for analysis.
I could use boto to grab the data from S3, but I'm wondering whether there is a more elegant method as part of the SageMaker framework to do this in my python code?
Thanks in advance for any advice.
AWS has created a great boto3 library, which allows for easy access to aws ecosystem of tools and products. SageMaker is a part of aws ecosystem of tools, so it allows easy access to S3. One of the key concepts in boto3 is a resource , an abstraction that provides access to AWS API and resources.
You can load data from AWS S3 into AWS SageMaker using Boto3 or AWSWranger. In this tutorial, you’ll learn how to load data from AWS S3 into SageMaker jupyter notebook. Note: This will only access the data from S3. The files will not be downloaded to the SageMaker Instance itself.
Show activity on this post. One way to solve this would be to save the CSV to the local storage on the SageMaker notebook instance, and then use the S3 API's via boto3 to upload the file as an s3 object. S3 docs for upload_file () available here.
Using the SageMaker and Boto3, upload the training and validation datasets to the default Amazon S3 bucket. The datasets in the S3 bucket will be used by a compute-optimized SageMaker instance on Amazon EC2 for training.
You can import either a single file or multiple files as a dataset. You can use the multifile import operation when you have a dataset that is partitioned into separate files. It takes all of the files from an Amazon S3 directory and imports them as a single dataset.
import boto3
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)
In the simplest case you don't need boto3
, because you just read resources.
Then it's even simpler:
import pandas as pd
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)
But as Prateek stated make sure to configure your SageMaker notebook instance to have access to s3. This is done at configuration step in Permissions > IAM role
If you have a look here it seems you can specify this in the InputDataConfig. Search for "S3DataSource" (ref) in the document. The first hit is even in Python, on page 25/26.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With