Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom Apache Beam Python version in Dataflow

I am wondering if it is possible to have a custom Apache Beam Python version running in Google Dataflow. A version that is not available in the public repositories (as of this writing: 0.6.0 and 2.0.0). For example, the HEAD version from the official repository of Apache Beam, or a specific tag for that matter.

I am aware of the possibility of packing custom packages (private local ones for example) as described in the official documentation. There are answered are questions here on how to do this for some other scripts. And there is even a GIST guiding on this.

But I have not managed to get the current Apache Beam developing version (or a tagged one) that is available in the master branch of its official repository to get packaged and sent along my script to Google Dataflow. For example, for the latest available tag, whose link for PiP to process would be: git+https://github.com/apache/[email protected]#egg=apache_beam[gcp]&subdirectory=sdks/python I get something like this:

INFO:root:Executing command: ['.../bin/python', '-m', 'pip', 'install', '--download', '/var/folders/nw/m_035l9d7f1dvdbd7rr271tcqkj80c/T/tmpJhCkp8', 'apache-beam==2.1.0', '--no-binary', ':all:', '--no-deps']
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting apache-beam==2.1.0
  Could not find a version that satisfies the requirement apache-beam==2.1.0 (from versions: 0.6.0, 2.0.0)
No matching distribution found for apache-beam==2.1.0

Any ideas? (I am wondering if it is even possible since Google Dataflow may have fixed the versions of Apache Beam that can run to the official released ones).

like image 231
Guille Avatar asked Jan 30 '23 21:01

Guille


1 Answers

I will answer myself as I got the answer of this question at one Apache Beam's JIRA I have been helping with.

If you want to use a custom Apache Beam Python version in Google Cloud Dataflow (that is, run your pipeline with the --runner DataflowRunner, you must use the option --sdk_location <apache_beam_v1.2.3.tar.gz> when you run your pipeline; where <apache_beam_v1.2.3.tar.gz> is the location of the corresponding packaged version that you want to use.

For example, as of this writing, if you have checked out the HEAD version of the Apache Beam's git repository, you have to first package the repository by navigating to the Python SDK with cd beam/sdks/python and then run python setup.py sdist (a compressed tar file will be created in the distsubdirectory).

Thereafter you can run your pipeline like this:

python your_pipeline.py [...your_options...] --sdk_location beam/sdks/python/dist/apache-beam-2.2.0.dev0.tar.gz

Google Cloud Dataflow will use the supplied SDK.

like image 110
Guille Avatar answered Feb 02 '23 10:02

Guille