Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading data from S3 to dask dataframe

I can load the data only if I change the "anon" parameter to True after making the file public.

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'anon':False})

This is not recommended for obvious reasons. How do I load the data from S3 securely?

like image 678
shantanuo Avatar asked Dec 13 '22 13:12

shantanuo


1 Answers

The backend which loads the data from s3 is s3fs, and it has a section on credentials here, which mostly points you to boto3's documentation.

The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).

Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'key': mykey, 'secret': mysecret})

The set of parameters you can pass in storage_options when using s3fs can be found in the API docs.

General reference http://docs.dask.org/en/latest/remote-data-services.html

like image 162
mdurant Avatar answered Jan 02 '23 13:01

mdurant