Uploading a Dataframe to AWS S3 Bucket from SageMaker

Tags:

I am new to AWS environment and trying to solve how the data flow works. After successfully uploading CSV files from S3 to SageMaker notebook instance, I am stuck on doing the reverse.

I have a dataframe and want to upload that to S3 Bucket as CSV or JSON. The code that I have is below:

bucket='bucketname'
data_key = 'test.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
df.to_csv(data_location)

I assumed since I successfully used pd.read_csv() while loading, using df.to_csv() would also work but it didn't. Probably it is generating error because this way I cannot pick the privacy options while uploading a file manually to S3. Is there a way to upload the data to S3 from SageMaker?

463

asked Jun 28 '19 00:06

realkes

2 Answers

One way to solve this would be to save the CSV to the local storage on the SageMaker notebook instance, and then use the S3 API's via boto3 to upload the file as an s3 object. S3 docs for upload_file() available here.

Note, you'll need to ensure that your SageMaker hosted notebook instance has proper ReadWrite permissions in its IAM role, otherwise you'll receive a permissions error.

# code you already have, saving the file locally to whatever directory you wish
file_name = "mydata.csv" 
df.to_csv(file_name)

# instantiate S3 client and upload to s3
import boto3

s3 = boto3.resource('s3')
s3.meta.client.upload_file(file_name, 'YOUR_S3_BUCKET_NAME', 'DESIRED_S3_OBJECT_NAME')

Alternatively, upload_fileobj() may help for parallelizing as a multi-part upload.

answered Oct 19 '22 21:10

Nick Walsh

You can use boto3 to upload a file but, given that you're working with dataframe and pandas you should consider dask. You can install it via conda install dask s3fs

import dask.dataframe as dd

Read from S3

df = dd.read_csv('s3://{}/{}'.format(bucket, data2read),
                 storage_options={'key': AWS_ACCESS_KEY_ID,
                                   'secret': AWS_SECRET_ACCESS_KEY})

Update

Now if you want to use this file as a pandas dataframe you should compute it as

df = df.compute()

Write to S3

To write back to S3 you should first load your df to dask with the number of partition (must be specified) you need

df = dd.from_pandas(df, npartitions=N)

And then you can upload to S3

df.to_csv('s3://{}/{}'.format(bucket, data2write),
          storage_options={'key': AWS_ACCESS_KEY_ID,
                           'secret': AWS_SECRET_ACCESS_KEY})

Update

Despite the API is similar the to_csv in pandas is not the same as the one in dask in particular the latter has the storage_options parameter. Furthermore dask doesn't save to a unique file. Let me explain: if you decide that to write to s3://my_bucket/test.csv with dask then instead of have a file called test.csv you are going to have a folder with that name that contain N files where N is the number of partitions we decided before.

Final Note

I understand that it could feel strange to save to multiple files but given that dask read all files on a folder, once you get used to, it could be very convenient.

answered Oct 19 '22 21:10

rpanai

Related questions
                            
                                Tensorflow dilation behave differently than morphological dilation
                            
                                Python 3: How to submit an async function to a threadPool?
                            
                                django deploy to Heroku : Server Error(500)
                            
                                Send numpy array as bytes from python to JS through Flask
                            
                                Is there a C# equivalent of Pythons chr and ord?
                            
                                Python string concatenation internal details
                            
                                unsupported operand type(s) for +: 'int' and 'str' using Pandas mean
                            
                                Upload CSV file using Python Flask and process it
                            
                                SQLAlchemy verify SSL connection
                            
                                Is there a pytorch method to check the number of cpus?
                            
                                Merge 'left', but override 'right' values where possible
                            
                                Resample with categories in pandas, keep non-numerical columns
                            
                                How to reshape a list without numpy
                            
                                Python console in Power BI
                            
                                BucketIterator throws 'Field' object has no attribute 'vocab'
                            
                                Is it possible to specify handle_unknown = 'ignore' for certain columns and 'error' for others inside OneHotEncoder?
                            
                                Efficient metrics evaluation in PyTorch
                            
                                tqdm: extract time passed + time remaining?
                            
                                How to pass parameters to Airflow on_success_callback and on_failure_callback
                            
                                How to find skewness and kurtosis correctly in pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Uploading a Dataframe to AWS S3 Bucket from SageMaker

Tags:

python

pandas

amazon-web-services

amazon-s3

amazon-sagemaker

realkes

People also ask

2 Answers

Nick Walsh

Read from S3

Update

Write to S3

Update

Final Note

rpanai

Recent Activity

Donate For Us