I am working in python and jupyter notebook, and I am trying to read parquet files from an aws s3bucket, and convert them to a single pandas dataframe.
The bucket and folders are arranged like:
The bucket name: mybucket
First Folder: 123
Second Folder: Parquets.parquet
file1.snappy.parquet
file2.snappy.parquet
....
I am getting the full path with:
bucket = s3.Bucket(name='mybucket')
keys =[]
for key in bucket.objects.all():
keys.append("s3://mybucket/"+key.key)
And then reading them with:
count = 0
keys = keys[2:]
for obj in bucket.objects.all():
subsrc = obj.Object()
key = obj.key
path = keys[count]
obj_df = pd.read_parquet(path)
df_list.append(obj_df)
count +=1
df = pd.concat(df_list)
But that is giving me:
PermissionError: Forbidden
pointing to the line 'obj_df = pd.read_parquet(path)' I know I have full s3 access, so that should not be the issue. Thank you so much!
This is probably because the path to the data is incorrect.
(In the code above, you're doing pd.read_parquet(path)
where path = keys[count]
, but I'm pretty sure that that's only the keys, which do not include the bucket name. )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With