Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: SparkFiles.get(fileName.txt) - Unable to retrieve the file contents from SparkContext

Tags:

I used SparkContext.addFile("hdfs://host:54310/spark/fileName.txt") and added a file to SparkContext. I verified its presence using org.apache.spark.SparkFiles.get(fileName.txt). It showed an absolute path, something like /tmp/spark-xxxx/userFiles-xxxx/fileName.txt.

Now I want to read that file from the above given absolute path location from SparkContext. I tried sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println) It considers the path returned by SparkFiles.get() as a HDFS path, which is incorrect.

I searched extensively to find any helpful reads on this, but ran out of luck.

Is there anything wrong in the approach? Any help is really appreciated.

Here is the code and the outcome:

scala> sc.addFile("hdfs://localhost:54310/spark/fileName.txt")

scala> org.apache.spark.SparkFiles.get("fileName.txt")
res23: String = /tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt

scala> sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:54310/tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt
  at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
  ... 49 elided
like image 806
Marco99 Avatar asked Jun 30 '18 14:06

Marco99


People also ask

What is SparkFiles in PySpark?

PySpark provides the facility to upload your files using sc. addFile. We can also get the path of working directory using SparkFiles. get.

What is SparkContext?

A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Only one SparkContext should be active per JVM.

Can Spark read from local file system?

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

How do I access Spark files?

To access the file in Spark jobs, use SparkFiles. get() with the filename to find its download location. A directory can be given if the recursive option is set to True. Currently directories are only supported for Hadoop-supported filesystems.


1 Answers

Refer to local file using the "file://" syntax.

sc.textFile("file://" + org.apache.spark.SparkFiles.get("fileName.txt"))
.collect()
.foreach(println)
like image 175
Sudev Ambadi Avatar answered Sep 19 '22 12:09

Sudev Ambadi