Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load S3 Data into AWS SageMaker Notebook

I've just started to experiment with AWS SageMaker and would like to load data from an S3 bucket into a pandas dataframe in my SageMaker python jupyter notebook for analysis.

I could use boto to grab the data from S3, but I'm wondering whether there is a more elegant method as part of the SageMaker framework to do this in my python code?

Thanks in advance for any advice.

like image 522
A555h55 Avatar asked Jan 15 '18 14:01

A555h55


People also ask

Can SageMaker read from S3?

AWS has created a great boto3 library, which allows for easy access to aws ecosystem of tools and products. SageMaker is a part of aws ecosystem of tools, so it allows easy access to S3. One of the key concepts in boto3 is a resource , an abstraction that provides access to AWS API and resources.

How to load data from AWS S3 into SageMaker Jupyter Notebook?

You can load data from AWS S3 into AWS SageMaker using Boto3 or AWSWranger. In this tutorial, you’ll learn how to load data from AWS S3 into SageMaker jupyter notebook. Note: This will only access the data from S3. The files will not be downloaded to the SageMaker Instance itself.

How to upload a CSV file to SageMaker notebook?

Show activity on this post. One way to solve this would be to save the CSV to the local storage on the SageMaker notebook instance, and then use the S3 API's via boto3 to upload the file as an s3 object. S3 docs for upload_file () available here.

How to use SageMaker and boto3 with Amazon S3?

Using the SageMaker and Boto3, upload the training and validation datasets to the default Amazon S3 bucket. The datasets in the S3 bucket will be used by a compute-optimized SageMaker instance on Amazon EC2 for training.

How do I import a dataset from Amazon S3?

You can import either a single file or multiple files as a dataset. You can use the multifile import operation when you have a dataset that is partitioned into separate files. It takes all of the files from an Amazon S3 directory and imports them as a single dataset.


3 Answers

import boto3
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)
like image 141
Chhoser Avatar answered Oct 23 '22 00:10

Chhoser


In the simplest case you don't need boto3, because you just read resources.
Then it's even simpler:

import pandas as pd

bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)

But as Prateek stated make sure to configure your SageMaker notebook instance to have access to s3. This is done at configuration step in Permissions > IAM role

like image 41
ivankeller Avatar answered Oct 23 '22 00:10

ivankeller


If you have a look here it seems you can specify this in the InputDataConfig. Search for "S3DataSource" (ref) in the document. The first hit is even in Python, on page 25/26.

like image 11
Jonatan Avatar answered Oct 23 '22 01:10

Jonatan