list S3 files in Pyspark

Question

I am new to Pyspark and trying to use spark.read method to read S3 files in dataframe. I was able to successfully read one file from S3. Now I need to iterate and read all the files in a bucket.

My question is how to iterate and get all the files one by one.

I used to do this in in Python using boto3, is there something similar in Pyspark. s3_client.list_objects

Kulasangar · Accepted Answer

What if you use the SparkSession and SparkContext to read the files at once and then loop through thes s3 directory by using wholeTextFiles method. You can utilize the s3a connector in the url which allows to read from s3 through Hadoop.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('S3Example').getOrCreate()

s3_bucket = 'your-bucket'
s3_path = f's3a://{s3_bucket}/my-directory/'

# List files S3
file_list = spark.sparkContext.wholeTextFiles(s3_path).map(lambda x: x[0]).collect()

for file_path in file_list:
    print(file_path)

Please note, above I've only retrieved the file paths. If you want both, you can avoid only extracting the file path (x[0] in the lambda), and get both.

file_tuple = spark.sparkContext.wholeTextFiles(s3_path)

list S3 files in Pyspark

Tags:

python

amazon-s3

apache-spark

boto3

pyspark

PythonDeveloper

1 Answers

Kulasangar

Recent Activity

Donate For Us

list S3 files in Pyspark

Tags:

python

amazon-s3

apache-spark

boto3

pyspark

PythonDeveloper

1 Answers

Kulasangar

Related questions

Recent Activity

Donate For Us