How to list files in S3 bucket using Spark Session?

2 Answers

You can use Hadoop API for accessing files on S3 (Spark uses it as well):

import java.net.URI
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration

val path = "s3://somebucket/somefolder"
val fileSystem = FileSystem.get(URI.create(path), new Configuration())
val it = fileSystem.listFiles(new Path(path), true)
while (it.hasNext()) {
  ...
}

147

answered Nov 07 '22 02:11

Michael Spector

Approach 1

For pyspark users, I've translated Michael Spector's answer (I'll leave it to you to decide if using this is a good idea):

sc = spark.sparkContext
myPath = f's3://my-bucket/my-prefix/'
javaPath = sc._jvm.java.net.URI.create(myPath)
hadoopPath = sc._jvm.org.apache.hadoop.fs.Path(myPath)
hadoopFileSystem = sc._jvm.org.apache.hadoop.fs.FileSystem.get(javaPath, sc._jvm.org.apache.hadoop.conf.Configuration())
iterator = hadoopFileSystem.listFiles(hadoopPath, True)

s3_keys = []
while iterator.hasNext():
    s3_keys.append(iterator.next().getPath().toUri().getRawPath())

s3_keys now holds all file keys found at my-bucket/my-prefix

Approach 2 Here is an alternative that I found (hat tip to @forgetso):

myPath = 's3://my-bucket/my-prefix/*'
hadoopPath = sc._jvm.org.apache.hadoop.fs.Path(myPath)
hadoopFs = hadoopPath.getFileSystem(sc._jvm.org.apache.hadoop.conf.Configuration())
statuses = hadoopFs.globStatus(hadoopPath)

for status in statuses:
  status.getPath().toUri().getRawPath()
  # Alternatively, you can get file names only with:
  # status.getPath().getName()

Approach 3 (incomplete!)

The two approaches above do not use the Spark parallelism mechanism that would be applied on a distributed read. That logic looks private though. See parallelListLeafFiles here. I have not found a way to compel pyspark do to a distributed ls on s3 without also reading the file contents. I tried to use py4j to instantiate an InMemoryFileIndex, but can't get the incantation right. Here is what I have so far if someone wants to pick it up from here:

myPath = f's3://my-bucket/my-path/'
paths = sc._gateway.new_array(sc._jvm.org.apache.hadoop.fs.Path, 1)
paths[0] = sc._jvm.org.apache.hadoop.fs.Path(myPath)

emptyHashMap = sc._jvm.java.util.HashMap()
emptyScalaMap = sc._jvm.scala.collection.JavaConversions.mapAsScalaMap(emptyMap)

# Py4J is not happy with this:
sc._jvm.org.apache.spark.sql.execution.datasources.InMemoryFileIndex(
    spark._jsparkSession, 
    paths, 
    emptyScalaMap, 
    sc._jvm.scala.Option.empty() # Optional None
)

answered Nov 07 '22 01:11

Lou Zell

Related questions
                            
                                Spark and SparkSQL: How to imitate window function?
                            
                                How to check that the SparkContext has been stopped?
                            
                                How to find the nearest neighbors of 1 Billion records with Spark?
                            
                                update query in Spark SQL
                            
                                Pyspark: TaskMemoryManager: Failed to allocate a page: Need help in Error Analysis
                            
                                How to Stop running Spark Streaming application Gracefully?
                            
                                Get Last Monday in Spark
                            
                                Spark application kills executor
                            
                                How to restart Spark service in EMR after changing conf settings?
                            
                                Why accesing DataFrame from UDF results in NullPointerException?
                            
                                pyspark; check if an element is in collect_list [duplicate]
                            
                                Read ORC files directly from Spark shell
                            
                                How can I change SparkContext.sparkUser() setting (in pyspark)?
                            
                                Exiting Spark-shell from the scala script
                            
                                Spark java.lang.StackOverflowError
                            
                                Spark submit to yarn as a another user
                            
                                what is the most efficient way in pyspark to reduce a dataframe?
                            
                                How do we rank dataframe?
                            
                                Submitting spring boot application jar to spark-submit
                            
                                Pass system property to spark-submit and read file from classpath or custom path

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to list files in S3 bucket using Spark Session?

Tags:

amazon-s3

apache-spark

apache-spark-sql

code

People also ask

2 Answers

Michael Spector

Lou Zell

Recent Activity

Donate For Us