I am trying to get a list of parquet files paths from s3 that are inside of subdirectories and subdirectories of subdirectories (and so on and so forth).
If it was my local file system I would do this:
import glob
glob.glob('C:/Users/user/info/**/*.parquet', recursive=True)
I have tried using the glob method of s3fs however it doesn't have a recursive kwarg.
Is there a function I can use or do I need to implement it myself ?
You can use s3fs with glob:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)
s3.glob('your/s3/path/here/*.parquet')
I also wanted to download the latest file from s3 bucket but located in a specific folder. Initially, I tried using glob but couldn't find a solution to this problem. Finally, I build following function to solve this problem. You can modify this function to work with subfolders.
This function will return dictionary of all filenames and timestamp in key-value pair
(Key: file_name, value: timestamp).
Just pass bucket name and prefix (which is folder name).
import boto3
def get_file_names(bucket_name,prefix):
"""
Return the latest file name in an S3 bucket folder.
:param bucket: Name of the S3 bucket.
:param prefix: Only fetch keys that start with this prefix (folder name).
"""
s3_client = boto3.client('s3')
objs = s3_client.list_objects_v2(Bucket=bucket_name)['Contents']
shortlisted_files = dict()
for obj in objs:
key = obj['Key']
timestamp = obj['LastModified']
# if key starts with folder name retrieve that key
if key.startswith(prefix):
# Adding a new key value pair
shortlisted_files.update( {key : timestamp} )
return shortlisted_files
latest_filename = get_latest_file_name(bucket_name='use_your_bucket_name',prefix = 'folder_name/')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With