I am new to Pyspark and trying to use spark.read method to read S3 files in dataframe. I was able to successfully read one file from S3. Now I need to iterate and read all the files in a bucket.
My question is how to iterate and get all the files one by one.
I used to do this in in Python using boto3, is there something similar in Pyspark. s3_client.list_objects
What if you use the SparkSession
and SparkContext
to read the files at once and then loop through thes s3 directory by using wholeTextFiles method. You can utilize the s3a
connector in the url which allows to read from s3 through Hadoop.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('S3Example').getOrCreate()
s3_bucket = 'your-bucket'
s3_path = f's3a://{s3_bucket}/my-directory/'
# List files S3
file_list = spark.sparkContext.wholeTextFiles(s3_path).map(lambda x: x[0]).collect()
for file_path in file_list:
print(file_path)
Please note, above I've only retrieved the file paths. If you want both, you can avoid only extracting the file path (x[0] in the lambda), and get both.
file_tuple = spark.sparkContext.wholeTextFiles(s3_path)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With