How do you test Beam pipeline (Google Dataflow) in Python?

Tags:

I am having understanding how we are supposed to test our pipeline using Google DataFlow(based on Apache Beam) Python SDK.

https://beam.apache.org/documentation/pipelines/test-your-pipeline/ https://cloud.google.com/dataflow/pipelines/creating-a-pipeline-beam

The above link is ONLY for Java. I am pretty confused as to why Google will point to Java Apache testing.

I want to be able to view the results of a CoGroupByKey join on two p collections. I am coming from a Python background, and I have little to no experience using Beam/Dataflow.

Could really use any help. I know this is open ended to an extent.. basically I need to be able to view results within my pipeline and it's preventing me from seeing the results of my CoGroupByKey Join.

Code Below

    #dwsku, product are PCollections coming from BigQuery. Nested Values as 
    #well in Product, but not dwsku
    d1 = {'dwsku': dwsku, 'product': product}
    results = d1 | beam.CoGroupByKey()
    print results

What is printed:

    PCollection[CoGroupByKey/Map(_merge_tagged_vals_under_key).None]

227

asked Nov 21 '17 17:11

codebrotherone

1 Answers

If you want to test it locally on your machine, you should start with using DirectRunner and then you will be able to debug it - either by printing logs or by stopping the execution in debugger.

In order to see whole PCollection locally you can do the following:

d1 = {'dwsku': dwsku, 'product': product}
results = d1 | beam.CoGroupByKey()

def my_debug_function(pcollection_as_list):
    # add a breakpoint in this function or just print
    print pcollection_as_list

debug = (results | beam.combiners.ToList() | beam.Map(my_debug_function))

There are a few things to remember in here:

ToList() transform can potentially allocate a lot of memory
while using DirectRunner you should use .wait_until_finish() method of your pipeline, so that you script will not end before the pipeline finishes executing
if your pipeline downloads data from BigQuery, you should put LIMIT in the query when running locally

191

answered Sep 27 '22 20:09

Marcin Zablocki

Related questions
                            
                                Install python 32 bit on 64 bit linux
                            
                                How are bignums represented internally?
                            
                                virtualenv doesn't copy all .py files from the lib/python directory
                            
                                How does Garbage Collection work with multiple running processes/threads?
                            
                                Python 2.7 how parse a date with format 2014-05-01 18:10:38-04:00 [duplicate]
                            
                                Python - pip install pandas, not working
                            
                                How to fix a regex that attemps to catch some word and id?
                            
                                numpy.distutils.system_info.NotFoundError: no lapack/blas resources found
                            
                                Formatting multiple worksheets using xlsxwriter
                            
                                Python-Twilio is not sending sms with test credential.
                            
                                How to run py.test and linters in `python setup.py test`
                            
                                How to speed up resample procedure in Pandas?
                            
                                matplotlib generating strange y-axis on certain data sets?
                            
                                Looking for an universal way to parse price into decimal
                            
                                how can you tell if github repository is for python 2 or python 3
                            
                                Python/Splinter: How to find and select an option on a site?
                            
                                How do I view data object contents within an npz file?
                            
                                Force Airflow's backfill command to run sequentially
                            
                                Matplotlib 2.0 stripes in histogram
                            
                                Airflow custom jinja2 filters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you test Beam pipeline (Google Dataflow) in Python?

Tags:

python-2.7

apache-beam

google-cloud-dataflow

codebrotherone

People also ask

1 Answers

Marcin Zablocki

Recent Activity

Donate For Us