I can load the data only if I change the "anon" parameter to True after making the file public.
df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False})
This is not recommended for obvious reasons. How do I load the data from S3 securely?
The backend which loads the data from s3 is s3fs, and it has a section on credentials here, which mostly points you to boto3's documentation.
The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).
Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers
df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'key': mykey, 'secret': mysecret})
The set of parameters you can pass in storage_options
when using s3fs can be found in the API docs.
General reference http://docs.dask.org/en/latest/remote-data-services.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With