Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

list S3 files in Pyspark

I am new to Pyspark and trying to use spark.read method to read S3 files in dataframe. I was able to successfully read one file from S3. Now I need to iterate and read all the files in a bucket.

My question is how to iterate and get all the files one by one.

I used to do this in in Python using boto3, is there something similar in Pyspark. s3_client.list_objects

like image 795
PythonDeveloper Avatar asked Sep 15 '25 22:09

PythonDeveloper


1 Answers

What if you use the SparkSession and SparkContext to read the files at once and then loop through thes s3 directory by using wholeTextFiles method. You can utilize the s3a connector in the url which allows to read from s3 through Hadoop.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('S3Example').getOrCreate()

s3_bucket = 'your-bucket'
s3_path = f's3a://{s3_bucket}/my-directory/'

# List files S3
file_list = spark.sparkContext.wholeTextFiles(s3_path).map(lambda x: x[0]).collect()

for file_path in file_list:
    print(file_path)

Please note, above I've only retrieved the file paths. If you want both, you can avoid only extracting the file path (x[0] in the lambda), and get both.

file_tuple = spark.sparkContext.wholeTextFiles(s3_path)
like image 82
Kulasangar Avatar answered Sep 19 '25 13:09

Kulasangar