--files option in pyspark not working

Tags:

I tried sc.addFile option (working without any issues) and --files option from the command line (failed).

Run 1 : spark_distro.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

def import_my_special_package(x):
    from external_package import external
    ext = external()
    return ext.fun(x)

conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
sc.addFile("/local-path/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x)))

external package: external_package.py

class external(object):
    def __init__(self):
        pass
    def fun(self,input):
        return input*2

readme.txt

MY TEXT HERE

spark-submit command

spark-submit \
  --master yarn-client \
  --py-files /path to local codelib/external_package.py  \
  /local-pgm-path/spark_distro.py  \
  1000

Output: Working as expected

['MY TEXT HERE']

But if i try to pass the file(readme.txt) from command line using --files (instead of sc.addFile)option it is failing. Like below.

Run 2 : spark_distro.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

def import_my_special_package(x):
    from external_package import external
    ext = external()
    return ext.fun(x)

conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x)))

external_package.py Same as above

spark submit

spark-submit \
  --master yarn-client \
  --py-files /path to local codelib/external_package.py  \
  --files /local-path/readme.txt#readme.txt  \
  /local-pgm-path/spark_distro.py  \
  1000

Output:

Traceback (most recent call last):
  File "/local-pgm-path/spark_distro.py", line 31, in <module>
    with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'

Is sc.addFile and --file used for same purpose? Can someone please share your thoughts.

542

asked Nov 08 '17 18:11

goks

1 Answers

I have finally figured out the issue, and it is a very subtle one indeed.

As suspected, the two options (sc.addFile and --files) are not equivalent, and this is (admittedly very subtly) hinted at the documentation (emphasis added):

addFile(path, recursive=False)
Add a file to be downloaded with this Spark job on every node.

--files FILES
Comma-separated list of files to be placed in the working directory of each executor.

In plain English, while files added with sc.addFile are available to both the executors and the driver, files added with --files are available only to the executors; hence, when trying to access them from the driver (as is the case in the OP), we get a No such file or directory error.

Let's confirm this (getting rid of all the irrelevant --py-files and 1000 stuff in the OP):

test_fail.py:

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:  
    lines = [line.strip() for line in test_file]
print(lines)

Test:

spark-submit --master yarn \
             --deploy-mode client \
             --files /home/ctsats/readme.txt \
             /home/ctsats/scripts/SO/test_fail.py

Result:

[...]
17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt
[...]
Traceback (most recent call last):
  File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module>
    with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt'

In the above script test_fail.py, it is the driver program that requests access to the file readme.txt; let's change the script, so that access is requested for the executors (test_success.py):

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)

lines = sc.textFile("readme.txt") # run in the executors
print(lines.collect())

Test:

spark-submit --master yarn \
             --deploy-mode client \
             --files /home/ctsats/readme.txt \
             /home/ctsats/scripts/SO/test_success.py

Result:

[...]
17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt
[...]
[u'MY TEXT HERE']

Notice also that here we don't need SparkFiles.get - the file is readily accessible.

As said above, sc.addFile will work in both cases, i.e. when access is requested either by the driver or by the executors (tested but not shown here).

Regarding the order of the command line options: as I have argued elsewhere, all Spark-related arguments must be before the script to be executed; arguably, the relative order of --files and --py-files is irrelevant (leaving it as an exercise).

Tested with both Spark 1.6.0 & 2.2.0.

UPDATE (after the comments): Seems that my fs.defaultFS setting points to HDFS, too:

$ hdfs getconf -confKey fs.defaultFS
hdfs://host-hd-01.corp.nodalpoint.com:8020

But let me focus on the forest here (instead of the trees, that is), and explain why this whole discussion is of academic interest only:

Passing files to be processed with the --files flag is bad practice; in hindsight, I can now see why I could find almost no use references online - probably nobody uses it in practice, and with good reason.

(Notice that I am not talking for --py-files, which serves a different, legitimate role.)

Since Spark is a distributed processing framework, running over a cluster and a distributed file system (HDFS), the best thing to do is to have all files to be processed into the HDFS already - period. The "natural" place for files to be processed by Spark is the HDFS, not the local FS - although there are some toy examples using the local FS for demonstration purposes only. What's more, if you want some time in the future to change the deploy mode to cluster, you'll discover that the cluster, by default, knows nothing of local paths and files, and rightfully so...

answered Oct 12 '22 20:10

desertnaut

Related questions
                            
                                Apache Spark EOF exception
                            
                                How to save and load MLLib model in Apache Spark?
                            
                                Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set
                            
                                How to read records in JSON format from Kafka using Structured Streaming?
                            
                                'map-side' aggregation in Spark
                            
                                Spark MLlib LDA, how to infer the topics distribution of a new unseen document?
                            
                                How to convert spark DataFrame to RDD mllib LabeledPoints?
                            
                                Spark simpler value_counts
                            
                                Spark from_json with dynamic schema
                            
                                How to sort within partitions (and avoid sort across the partitions) using RDD API?
                            
                                How to save latest offset that Spark consumed to ZK or Kafka and can read back after restart
                            
                                Create labeledPoints from Spark DataFrame in Python
                            
                                Convert an RDD to iterable: PySpark?
                            
                                How to fully utilize all Spark nodes in cluster?
                            
                                When to use Kryo serialization in Spark?
                            
                                Spark' Dataset unpersist behaviour
                            
                                Julia on Hadoop? [closed]
                            
                                Spark vs Flink low memory available
                            
                                Spark : multiple spark-submit in parallel
                            
                                How to add source file name to each row in Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

--files option in pyspark not working

Tags:

apache-spark

pyspark

hadoop-yarn

goks

People also ask

1 Answers

desertnaut

Recent Activity

Donate For Us