As per title. I'm aware of textFile
but, as the name suggests, it works only on text files. I would need to access files/directories inside a path on either HDFS or a local path. I'm using pyspark.
Use the hdfs dfs -ls command to list files in Hadoop archives. Run the hdfs dfs -ls command by specifying the archive directory location.
If you type hdfs dfs -ls / you will get list of directories in hdfs.
You can use hadoop fs -ls command to list files in the current directory as well as their details. The 5th column in the command output contains file size in bytes.
Using JVM gateway maybe is not so elegant, but in some cases the code below could be helpful:
URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apache.hadoop.fs.Path FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration()) status = fs.listStatus(Path('/some_dir/yet_another_one_dir/')) for fileStatus in status: print(fileStatus.getPath())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With