Apache Spark: SparkFiles.get(fileName.txt) - Unable to retrieve the file contents from SparkContext

Tags:

I used SparkContext.addFile("hdfs://host:54310/spark/fileName.txt") and added a file to SparkContext. I verified its presence using org.apache.spark.SparkFiles.get(fileName.txt). It showed an absolute path, something like /tmp/spark-xxxx/userFiles-xxxx/fileName.txt.

Now I want to read that file from the above given absolute path location from SparkContext. I tried sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println) It considers the path returned by SparkFiles.get() as a HDFS path, which is incorrect.

I searched extensively to find any helpful reads on this, but ran out of luck.

Is there anything wrong in the approach? Any help is really appreciated.

Here is the code and the outcome:

scala> sc.addFile("hdfs://localhost:54310/spark/fileName.txt")

scala> org.apache.spark.SparkFiles.get("fileName.txt")
res23: String = /tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt

scala> sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:54310/tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt
  at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
  ... 49 elided

806

asked Jun 30 '18 14:06

Marco99

1 Answers

Refer to local file using the "file://" syntax.

sc.textFile("file://" + org.apache.spark.SparkFiles.get("fileName.txt"))
.collect()
.foreach(println)

175

answered Sep 19 '22 12:09

Sudev Ambadi

Related questions
                            
                                Save arguments for later Javascript
                            
                                Working with both strings and substrings
                            
                                Input width fit to content
                            
                                Python Script to Automate Refreshing an Excel Spreadsheet
                            
                                memory error during hierarchical clustering Python 3.6
                            
                                prefill not working in razorpay
                            
                                Deleting Content Files in Drupal 8
                            
                                React Native - How to handle TextInput onChangeText event to wait before execute function?
                            
                                Plot with ggplot a graph with two y scales [duplicate]
                            
                                Replacement functions in R
                            
                                Vue Class Based Component Warning: Property is not defined on the instance but referenced during render
                            
                                Material UI 0.2x compatibility with React 16

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With