I am working in python and jupyter notebook, and I am trying to read parquet files from an aws s3bucket, and convert them to a single pandas dataframe.
The bucket and folders are arranged like:
The bucket name: mybucket
   First Folder: 123
      Second Folder: Parquets.parquet
        file1.snappy.parquet
        file2.snappy.parquet
        ....
I am getting the full path with:
bucket = s3.Bucket(name='mybucket')
keys =[]
for key in bucket.objects.all():
  keys.append("s3://mybucket/"+key.key)
And then reading them with:
count = 0
keys = keys[2:]
for obj in bucket.objects.all():
    subsrc = obj.Object()
    key = obj.key 
    path = keys[count]
    obj_df = pd.read_parquet(path)
    df_list.append(obj_df)
    count +=1
    
df = pd.concat(df_list)
But that is giving me:
PermissionError: Forbidden 
pointing to the line 'obj_df = pd.read_parquet(path)' I know I have full s3 access, so that should not be the issue. Thank you so much!
This is probably because the path to the data is incorrect.
(In the code above, you're doing pd.read_parquet(path) where path = keys[count], but I'm pretty sure that that's only the keys, which do not include the bucket name. )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With