Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

--files option in pyspark not working

I tried sc.addFile option (working without any issues) and --files option from the command line (failed).

Run 1 : spark_distro.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

def import_my_special_package(x):
    from external_package import external
    ext = external()
    return ext.fun(x)

conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
sc.addFile("/local-path/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x)))

external package: external_package.py

class external(object):
    def __init__(self):
        pass
    def fun(self,input):
        return input*2

readme.txt

MY TEXT HERE

spark-submit command

spark-submit \
  --master yarn-client \
  --py-files /path to local codelib/external_package.py  \
  /local-pgm-path/spark_distro.py  \
  1000

Output: Working as expected

['MY TEXT HERE']

But if i try to pass the file(readme.txt) from command line using --files (instead of sc.addFile)option it is failing. Like below.

Run 2 : spark_distro.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

def import_my_special_package(x):
    from external_package import external
    ext = external()
    return ext.fun(x)

conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x)))

external_package.py Same as above

spark submit

spark-submit \
  --master yarn-client \
  --py-files /path to local codelib/external_package.py  \
  --files /local-path/readme.txt#readme.txt  \
  /local-pgm-path/spark_distro.py  \
  1000

Output:

Traceback (most recent call last):
  File "/local-pgm-path/spark_distro.py", line 31, in <module>
    with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'

Is sc.addFile and --file used for same purpose? Can someone please share your thoughts.

like image 542
goks Avatar asked Nov 08 '17 18:11

goks


People also ask

How do I add files to PySpark?

In Apache Spark, you can upload your files using sc. addFile (sc is your default SparkContext) and get the path on a worker using SparkFiles. get. Thus, SparkFiles resolve the paths to files added through SparkContext.

How do I access Spark files?

To access the file in Spark jobs, use SparkFiles. get(fileName) to find its download location. A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

What is SparkFiles in PySpark?

PySpark provides the facility to upload your files using sc. addFile. We can also get the path of working directory using SparkFiles. get.

Can a driver program read local files of the node where we submit the Spark Program?

Show activity on this post. Yes, you can access files uploaded via the --files argument.


1 Answers

I have finally figured out the issue, and it is a very subtle one indeed.

As suspected, the two options (sc.addFile and --files) are not equivalent, and this is (admittedly very subtly) hinted at the documentation (emphasis added):

addFile(path, recursive=False)
Add a file to be downloaded with this Spark job on every node.

--files FILES
Comma-separated list of files to be placed in the working directory of each executor.

In plain English, while files added with sc.addFile are available to both the executors and the driver, files added with --files are available only to the executors; hence, when trying to access them from the driver (as is the case in the OP), we get a No such file or directory error.

Let's confirm this (getting rid of all the irrelevant --py-files and 1000 stuff in the OP):

test_fail.py:

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:  
    lines = [line.strip() for line in test_file]
print(lines)

Test:

spark-submit --master yarn \
             --deploy-mode client \
             --files /home/ctsats/readme.txt \
             /home/ctsats/scripts/SO/test_fail.py

Result:

[...]
17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt
[...]
Traceback (most recent call last):
  File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module>
    with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt'

In the above script test_fail.py, it is the driver program that requests access to the file readme.txt; let's change the script, so that access is requested for the executors (test_success.py):

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)

lines = sc.textFile("readme.txt") # run in the executors
print(lines.collect())

Test:

spark-submit --master yarn \
             --deploy-mode client \
             --files /home/ctsats/readme.txt \
             /home/ctsats/scripts/SO/test_success.py

Result:

[...]
17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt
[...]
[u'MY TEXT HERE']

Notice also that here we don't need SparkFiles.get - the file is readily accessible.

As said above, sc.addFile will work in both cases, i.e. when access is requested either by the driver or by the executors (tested but not shown here).

Regarding the order of the command line options: as I have argued elsewhere, all Spark-related arguments must be before the script to be executed; arguably, the relative order of --files and --py-files is irrelevant (leaving it as an exercise).

Tested with both Spark 1.6.0 & 2.2.0.

UPDATE (after the comments): Seems that my fs.defaultFS setting points to HDFS, too:

$ hdfs getconf -confKey fs.defaultFS
hdfs://host-hd-01.corp.nodalpoint.com:8020

But let me focus on the forest here (instead of the trees, that is), and explain why this whole discussion is of academic interest only:

Passing files to be processed with the --files flag is bad practice; in hindsight, I can now see why I could find almost no use references online - probably nobody uses it in practice, and with good reason.

(Notice that I am not talking for --py-files, which serves a different, legitimate role.)

Since Spark is a distributed processing framework, running over a cluster and a distributed file system (HDFS), the best thing to do is to have all files to be processed into the HDFS already - period. The "natural" place for files to be processed by Spark is the HDFS, not the local FS - although there are some toy examples using the local FS for demonstration purposes only. What's more, if you want some time in the future to change the deploy mode to cluster, you'll discover that the cluster, by default, knows nothing of local paths and files, and rightfully so...

like image 74
desertnaut Avatar answered Oct 12 '22 20:10

desertnaut