Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark job in Java: how to access files from 'resources' when run on a cluster

I wrote a Spark job in Java. The job is packaged as a shaded jar and executed:

spark-submit my-jar.jar

In the code, there are some files (Freemarker templates) that reside in src/main/resources/templates. When run locally, I'm able access the files:

File[] files = new File("src/main/resources/templates/").listFiles();

When the job is run on a cluster, a null-pointer exception is returned when the previous line is executed.

If I run jar tf my-jar.jar I can see that the files are packaged in a templates/ folder:

 [...]
 templates/
 templates/my_template.ftl
 [...]

I'm just unable to read them; I suspect that .listFiles() tries to access the local filesystem on the cluster node, and the files aren't there.

I'm curious to know how I should package files to be used within a self-contained Spark job. I'd rather not copy them to HDFS outside of the job because it becomes messy to maintain.

like image 325
Alex Woolford Avatar asked Apr 17 '16 18:04

Alex Woolford


People also ask

How do I access Spark files?

To access the file in Spark jobs, use SparkFiles. get() with the filename to find its download location. A directory can be given if the recursive option is set to True. Currently directories are only supported for Hadoop-supported filesystems.

When a Spark job is submitted then what happens on cluster?

with Spark, this is how underlying execution happens. There would be one driver program that works with the cluster manager to schedule tasks on the worker nodes. So, once these tasks are completed, they will return the result to the driver program.

Where does Spark drive in cluster mode?

Cluster mode: The Spark driver runs in the application master. The application master is the first container that runs when the Spark job runs.

How does a Spark program physically execute on a cluster?

A Spark program implicitly creates a logical directed acyclic graph (DAG) of operations. When the driver runs, it converts this logical graph into a physical execution plan. Here you can see that collect is an action that will collect all data and give a final result.


1 Answers

I have accessed my resource file like below in spark-scala. I've share my code please check.

val fs=this.getClass().getClassLoader().getResourceAsStream("smoke_test/loadhadoop.txt")

val dataString=scala.io.Source.fromInputStream(fs).mkString
like image 78
Anand Avatar answered Nov 15 '22 20:11

Anand