I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?
You can use org.apache.hadoop.fs.FileSystem
. Specifically, FileSystem.listFiles([path], true)
And with Spark...
FileSystem.get(sc.hadoopConfiguration).listFiles(..., true)
Edit
It's worth noting that good practice is to get the FileSystem
that is associated with the Path
's scheme.
path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)
Here's PySpark version if someone is interested:
hadoop = sc._jvm.org.apache.hadoop fs = hadoop.fs.FileSystem conf = hadoop.conf.Configuration() path = hadoop.fs.Path('/hivewarehouse/disc_mrt.db/unified_fact/') for f in fs.get(conf).listStatus(path): print(f.getPath(), f.getLen())
In this particular case I get list of all files that make up disc_mrt.unified_fact Hive table.
Other methods of FileStatus object, like getLen() to get file size are described here:
Class FileStatus
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With