Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where does spark look for text files?

Tags:

apache-spark

I thought that loading text files is done only from workers / within the cluster (you just need to make sure all workers have access to the same path, either by having that text file available on all nodes, or by use some shared folder mapped to the same path)

e.g. spark-submit / spark-shell can be launched from anywhere, and connect to a spark master, and the machine where you launched spark-submit / spark-shell (which is also where our driver runs, unless you are in "cluster" deploy mode) has nothing to do with the cluster. Therefore any data loading should be done only from the workers, not on the driver machine, right? e.g. there should be no way that sc.textFile("file:///somePath") will cause spark to look for a file on the driver machine (again, the driver is external to the cluster, e.g. in "client" deploy mode / standalone mode), right?

Well, this is what I thought too...

Our cast

  • machine A: where the driver runs
  • machine B: where both spark master and one of the workers run

Act I - The Hope

When I start a spark-shell from machine B to spark master on B I get this:

scala> sc.master
res3: String = spark://machinB:7077

scala> sc.textFile("/tmp/data/myfile.csv").count()
res4: Long = 976

Act II - The Conflict

But when I start a spark-shell from machine A, pointing to spark master on B I get this:

scala> sc.master
res2: String = spark://machineB:7077

scala> sc.textFile("/tmp/data/myfile.csv").count()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/data/myfile.csv

And indeed /tmp/data/myfile.csv does not exist on machine A, but machine A is not on the cluster, it's just where the driver runs

Act III - The Amazement

What’s even weirder is that if I make this file available on machine A, it doesn’t throw this error anymore. (Instead it creates a job, but no tasks, and just fails due to a timeout, which is another issue that deserves a separate question)

Is there something in the way that Spark behaves that I’m missing? I thought that spark shell when connected to a remote, has nothing to do with the machine you are running on. So why does the error stops when I put that file available on machine A? It means that the location of sc.textFile includes the location of where spark-shell or spark-submit were initiated (in my case also where the driver runs)? This makes zero sense to me. but again, I'm open to learn new things.

Epilogue

tl;dr - sc.textFile("file:/somePath") running form a driver on machine A to a cluster on machines B,C,D... (driver not part of cluster)

It seems like it's looking for path file:/somePath also on the driver, is that true (or is it just me)? is that known? is that as designed?

I have a feeling that this is some weird network / VPN topology issue unique to my workplace network, but still this is what happens to me, and I'm utterly confused whether it is just me or a known behavior. (or I'm simply not getting how Spark works, which is always an option)

like image 241
Eran Medan Avatar asked Sep 08 '15 18:09

Eran Medan


1 Answers

So the really short version of it the answer is, if you reference "file://..." it should be accessible on all nodes in your cluster including the dirver program. Sometimes some bits of work happen on the worker. Generally the way around this is just not using local files, and instead using something like S3, HDFS, or another network filesystem. There is the sc.addFile method which can be used to distribute a file from the driver to all of the other nodes (and then you use SparkFiles.get to resolve the download location).

like image 133
Holden Avatar answered Oct 07 '22 07:10

Holden