I used SparkContext.addFile("hdfs://host:54310/spark/fileName.txt")
and added a file to SparkContext
. I verified its presence using org.apache.spark.SparkFiles.get(fileName.txt)
. It showed an absolute path, something like /tmp/spark-xxxx/userFiles-xxxx/fileName.txt
.
Now I want to read that file from the above given absolute path location from
SparkContext
. I triedsc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)
It considers the path returned bySparkFiles.get()
as a HDFS path, which is incorrect.
I searched extensively to find any helpful reads on this, but ran out of luck.
Is there anything wrong in the approach? Any help is really appreciated.
Here is the code and the outcome:
scala> sc.addFile("hdfs://localhost:54310/spark/fileName.txt")
scala> org.apache.spark.SparkFiles.get("fileName.txt")
res23: String = /tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt
scala> sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:54310/tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
... 49 elided
PySpark provides the facility to upload your files using sc. addFile. We can also get the path of working directory using SparkFiles. get.
A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Only one SparkContext should be active per JVM.
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
To access the file in Spark jobs, use SparkFiles. get() with the filename to find its download location. A directory can be given if the recursive option is set to True. Currently directories are only supported for Hadoop-supported filesystems.
Refer to local file using the "file://" syntax.
sc.textFile("file://" + org.apache.spark.SparkFiles.get("fileName.txt"))
.collect()
.foreach(println)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With