Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to install private repository on Dataflow Worker?

We're facing issues during Dataflow jobs deployment.

The error

We are using CustomCommands to install private repo on workers, but we face now an error in the worker-startup logs of our jobs:

Running command: ['pip', 'install', 'git+ssh://[email protected]/[email protected]']

Command output: b'Traceback (most recent call last):
File "/usr/local/bin/pip", line 6, in <module>
from pip._internal import main\nModuleNotFoundError: No module named \'pip\'\n' 

This code was working but since our last deploy of the service on Friday, it's not.

Some context

  1. We use a GAE service with a cron job to deploy Dataflow Jobs, using the python sdk
  2. In our jobs, we're using code stored in private repository
  3. To allow the workers to pull private repositories, we use a setup.py with customCommands which are run during worker startup. (code example from official repo here)
  4. The commands retrieve an encoded ssh key from GCS, decode it with KMS, get a ssh config file to specify path of the key & allowed hosts then perform a pip install git+ssh://[email protected]/[email protected] (see commands below)

	# retrieve ssh key
    ["gsutil", "cp","gs://{bucket_name}/encrypted_python_repo_ssh_key".format(bucket_name=credentials_bucket), "encrypted_key"],
    ["chmod", "700", "decrypted_key"],
    # install git & ssh
    ["apt-get", "update"],
    ["apt-get", "install", "-y", "openssh-server"],
    ["apt-get", "install", "-y", "git"],

    # Add ssh config which specify the location of the key & the host
        "git+ssh://[email protected]/[email protected]",

What we tried

  • Following this issue in pip #5599, it seems that there is a conflict between several versions of pip. We tried to reinstall it adding apt-get --reinstall install -y python-setuptools python-wheel python-pip (and other variations like curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py --force-reinstall) in the CustomCommands but no specific improvement.

To Note:

  • Jobs started locally are working (How ? I'm quite curious how can it work since the CustomCommands are not run)
  • Logging in the compute instance & connect to the docker process & running the commands manually doesn't show any error log
  • Service is deployed using a custom Dockerfile defined by following snippet

FROM gcr.io/google-appengine/python
RUN apt-get update && apt-get install -y openssh-server
RUN virtualenv /env -p python3.7

# Setting these environment variables are the same as running
# source /env/bin/activate.
ENV PATH /env/bin:$PATH

# Set credentials for git run pip to install all
# dependencies into the virtualenv.
... specify SSH KEY, host, to allow private git repo pull 

# Add the application source code.
ADD . /app
RUN pip install -r /app/requirements.txt && python /app/setup.py install && python /app/setup.py build
CMD gunicorn -b :$PORT main:app

Any idea about how to solve this issue, or any workaround available ?

Thanks for your help !


This seems mostly due to local state of the machine, or our computers.

After running some commands like python setup.py install or python setup.py build, I'm now unable to deploy jobs anymore (facing the same error during worker-startup as deployed by the service), but my colleague is still able to deploy jobs (same code, same branch, except excluded directories from .gitignore like build, dist, ...) which are running. In his case, CustomCommands are not run on job deployment (but workers are still able to use local packaged pipeline).

Any way to specify a compiled package to use by worker ? I was not able to find doc on that...


As we were not able to pull private code from dataflow worker, we used the following workaround:

  • Build a wheel of our private repo using python setup.py sdist bdist_wheel
  • Embed this wheel in our dataflow package under lib/my-package-1.0.0-py3-none-any.whl
  • Pass the wheel to dataflow options as extra package (see beam code here)
Commands used
pipeline_options = PipelineOptions()
pipeline_options.view_as(SetupOptions).setup_file = "./setup.py"
pipeline_options.view_as(SetupOptions).extra_packages = ["./lib/my-package-1.0.0-py3-none-any.whl"]
like image 800
Colin Le Nost Avatar asked Jan 06 '20 16:01

Colin Le Nost

People also ask

What is the difference between installation from sources and with pip?

If you use setup.py , you have to visit the library's website, figure out where to download it, extract the file, run setup.py ... In contrast, pip will automatically search the Python Package Index (PyPi) to see if the package exists there, and will automatically download, extract, and install the package for you.

How install pip using setup py?

Installing Python Packages with Setup.py To install a package that includes a setup.py file, open a command or terminal window and: cd into the root directory where setup.py is located. Enter: python setup.py install.

Does pip need setup py?

As a first step, pip needs to get metadata about a package (name, version, dependencies, and more). It collects this by calling setup.py egg_info . The egg_info command generates the metadata for the package, which pip can then consume and proceed to gather all the dependencies of the package.

1 Answers

For anything but non-trivial, public dependencies I would recommend using custom containers and installing all the dependencies ahead of time.

like image 109
robertwb Avatar answered Oct 15 '22 14:10
