I am new to AWS environment and trying to solve how the data flow works. After successfully uploading CSV files from S3 to SageMaker notebook instance, I am stuck on doing the reverse.
I have a dataframe and want to upload that to S3 Bucket as CSV or JSON. The code that I have is below:
bucket='bucketname'
data_key = 'test.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
df.to_csv(data_location)
I assumed since I successfully used pd.read_csv()
while loading, using df.to_csv()
would also work but it didn't. Probably it is generating error because this way I cannot pick the privacy options while uploading a file manually to S3. Is there a way to upload the data to S3 from SageMaker?
The lifecycle configuration accesses the S3 bucket via AWS PrivateLink. This architecture allows our internet-disabled SageMaker notebook instance to access S3 files, without traversing the public internet.
If you want to download a file in your SageMaker Studio Lab project to your local system, you can right click on the file in the file browser UI in the left panel and select "Download".
One way to solve this would be to save the CSV to the local storage on the SageMaker notebook instance, and then use the S3 API's via boto3
to upload the file as an s3 object.
S3 docs for upload_file()
available here.
Note, you'll need to ensure that your SageMaker hosted notebook instance has proper ReadWrite
permissions in its IAM role, otherwise you'll receive a permissions error.
# code you already have, saving the file locally to whatever directory you wish
file_name = "mydata.csv"
df.to_csv(file_name)
# instantiate S3 client and upload to s3
import boto3
s3 = boto3.resource('s3')
s3.meta.client.upload_file(file_name, 'YOUR_S3_BUCKET_NAME', 'DESIRED_S3_OBJECT_NAME')
Alternatively, upload_fileobj()
may help for parallelizing as a multi-part upload.
You can use boto3
to upload a file but, given that you're working with dataframe and pandas
you should consider dask
. You can install it via conda install dask s3fs
import dask.dataframe as dd
df = dd.read_csv('s3://{}/{}'.format(bucket, data2read),
storage_options={'key': AWS_ACCESS_KEY_ID,
'secret': AWS_SECRET_ACCESS_KEY})
Now if you want to use this file as a pandas
dataframe you should compute it as
df = df.compute()
To write back to S3 you should first load your df
to dask with the number of partition (must be specified) you need
df = dd.from_pandas(df, npartitions=N)
And then you can upload to S3
df.to_csv('s3://{}/{}'.format(bucket, data2write),
storage_options={'key': AWS_ACCESS_KEY_ID,
'secret': AWS_SECRET_ACCESS_KEY})
Despite the API
is similar the to_csv
in pandas
is not the same as the one in dask
in particular the latter has the storage_options
parameter.
Furthermore dask
doesn't save to a unique file. Let me explain: if you decide that to write to s3://my_bucket/test.csv
with dask
then instead of have a file called test.csv
you are going to have a folder with that name that contain N
files where N
is the number of partitions we decided before.
I understand that it could feel strange to save to multiple files but given that dask
read all files on a folder, once you get used to, it could be very convenient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With