I'm having a hard time figuring out why Spark is not accessing a file that I add to the context. Below is my code in the repl:
scala> sc.addFile("/home/ubuntu/my_demo/src/main/resources/feature_matrix.json")
scala> val featureFile = sc.textFile(SparkFiles.get("feature_matrix.json"))
featureFile: org.apache.spark.rdd.RDD[String] = /tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json MappedRDD[1] at textFile at <console>:60
scala> featureFile.first()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: cfs://172.30.26.95/tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json
The file does in fact exist at /tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json
Any help appreciated.
If you are using addFile, then you need to use get to retrieve it. Also, the addFile method is lazy, so it is very possible that it was not put in the location you are finding it until you actually call first, so you are creating this kind of circle.
All that being said, I don't know that using SparkFiles as the first action is ever going to be a smart idea. Use something like --files with SparkSubmit and the files will be put in your working directory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With