Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error when reading a file in Spark

I'm having a hard time figuring out why Spark is not accessing a file that I add to the context. Below is my code in the repl:

scala> sc.addFile("/home/ubuntu/my_demo/src/main/resources/feature_matrix.json")

scala> val featureFile = sc.textFile(SparkFiles.get("feature_matrix.json"))

featureFile: org.apache.spark.rdd.RDD[String] = /tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json MappedRDD[1] at textFile at <console>:60

scala> featureFile.first()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: cfs://172.30.26.95/tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json

The file does in fact exist at /tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json

Any help appreciated.

like image 547
worker1138 Avatar asked Dec 22 '25 08:12

worker1138


1 Answers

If you are using addFile, then you need to use get to retrieve it. Also, the addFile method is lazy, so it is very possible that it was not put in the location you are finding it until you actually call first, so you are creating this kind of circle.

All that being said, I don't know that using SparkFiles as the first action is ever going to be a smart idea. Use something like --files with SparkSubmit and the files will be put in your working directory.

like image 152
Justin Pihony Avatar answered Dec 23 '25 21:12

Justin Pihony



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!