Easiest way to install Python dependencies on Spark executor nodes?

1 Answers

Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.

I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is

Create a virtualenv purely for your Spark nodes
Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies
Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have
Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files

Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:

#!/usr/bin/env bash # helper script to fulfil Spark's python packaging requirements. # Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of # supplied to --py-files argument of `pyspark` or `spark-submit` # First argument should be the top-level virtualenv # Second argument is the zipfile which will be created, and #   which you can subsequently supply as the --py-files argument to  #   spark-submit # Subsequent arguments are all the private packages you wish to install # If these are set up with setuptools, their dependencies will be installed  VENV=$1; shift ZIPFILE=$1; shift PACKAGES=$*  . $VENV/bin/activate for pkg in $PACKAGES; do   pip install --upgrade $pkg done TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid clashes with other processes ( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . ) mv $TMPZIP $ZIPFILE

I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.

There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.

The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.

I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

135

answered Oct 16 '22 00:10

Andy MacKinlay

Related questions
                            
                                How to specify username when putting files on HDFS from a remote machine?
                            
                                What exactly is hadoop namenode formatting?
                            
                                How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?
                            
                                what is HiveServer and Thrift server [closed]
                            
                                Sorting large data using MapReduce/Hadoop
                            
                                Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation
                            
                                Apache Pig: FLATTEN and parallel execution of reducers
                            
                                what is difference between partition and replica of a topic in kafka cluster
                            
                                Skip first line of csv while loading in hive table
                            
                                Running Apache Hadoop 2.1.0 on Windows
                            
                                Why does Hadoop need classes like Text or IntWritable instead of String or Integer?
                            
                                Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"?
                            
                                How to find the size of a HDFS file
                            
                                Save Spark dataframe as dynamic partitioned table in Hive
                            
                                Hadoop 2.2 Installation `.' no such file or directory
                            
                                Just enough Java for Hadoop [closed]
                            
                                Hadoop one Map and multiple Reduce
                            
                                putting a remote file into hadoop without copying it to local disk
                            
                                What is Google's Dremel? How is it different from Mapreduce?
                            
                                Hadoop DistributedCache is deprecated - what is the preferred API?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Easiest way to install Python dependencies on Spark executor nodes?

Tags:

dependencies

shared-libraries

distributed-computing

apache-spark

hadoop

trianta2

People also ask

1 Answers

Andy MacKinlay

Recent Activity

Donate For Us