Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get apache beam for dataflow GCP on Python 3.x

I'm very newby with GCP and dataflow. However , I would like to start to test and deploy few flows harnessing dataflow on GCP. According to the documentation and everything around dataflow is imperative use the Apache project BEAM. Therefore and following the official documentation here the supported version of python is 2.7

Honestly this is fairly disappointed due to the fact that Python version 2.x will vanish due not official support and everybody is working with version 3.x. Nevertheless, I want to know if someone knows how to get ready beam and GCP dataflow running in python version.

I saw this video and some how this parson complete this wonderful milestone and apparently it runs on Python 3.5.

Update:

Guys I want just raise a thought that has crossed my minds since I’m struggling with dataflow. I really feel highly disappointed in the sense how challenging is start hands on with this tool either version Java or Python. From python there are constrains about the version 3 which is pretty much the current standard. In the other hand, java has issues running on version 11 and I have to tweak a bit to run over version 8 my code and then I start to struggle with many incompatibilities on the code. Briefly , if really GCP wants to move forward and become the #1 there is so much much to improve. :disappointed:

Workaround:

I downgraded my java version to jdk 8 , install maven and now my eclipse version is working for Apache Beam.

I finally solved but, GCP really please consider enhance and span the support for the most recent versions of Java/Python.

Thanks so much

like image 312
Andres Urrego Angel Avatar asked Jan 24 '19 04:01

Andres Urrego Angel


People also ask

Does python support Dataflow?

Dataflow doesn't support Python 3.10. Use Python version 3.9 or earlier. For details about supported versions, see Apache Beam runtime support. In this section, use the command prompt to set up an isolated Python virtual environment to run your pipeline project by using venv.

Does Dataflow use Apache Beam?

The Apache Beam programming model simplifies the mechanics of large-scale data processing. Using one of the Apache Beam SDKs, you build a program that defines the pipeline. Then, one of Apache Beam's supported distributed processing backends, such as Dataflow, executes the pipeline.

Does Apache Beam support python 3?

Python 3 support Apache Beam 2.14. 0 and higher support Python 3.5, 3.6, and 3.7. We continue to improve the experience for Python 3 users and plan to phase out Python 2 support (BEAM-8371): See details on the Python SDK's Roadmap.


2 Answers

You can now run Apache Beam on Python 3.5 (I tried both on Direct as well as DataFlow runner).apache-beam==2.11.0

when running it comes with warning:

UserWarning: Running the Apache Beam SDK on Python 3 is not yet fully supported. You may encounter buggy behavior or missing features.

I already noticed, beam.io.gcp.pubsub.ReadFromPubSub() is broken. Pushing messages to PubSub but the pipeline never reads the messages (trying on Direct Runner).

Hope with time things will improve.

like image 125
Vibhor Jain Avatar answered Oct 06 '22 00:10

Vibhor Jain


See @VibhorJain 's answer, it is working now.


Currently there is NO way to use Python3 for apache-beam (you may write an adapter for it, but for sure meaningless).

The support of Python3.X is on going, please take a look on this apache-beam issue.

P.S. In the video, Python 3.5.2 is ONLY for the editor version, it is not the python running the apache-beam. Please be noticed, in the bash, Python 2.7 is running.

like image 37
MT-FreeHK Avatar answered Oct 06 '22 00:10

MT-FreeHK