Apache beam DataFlow runner throwing setup error

Question

We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,

A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.

But could not find detailed worker-startup logs.

We tried increasing memory size, worker count etc, but still getting the same error.

Here is the command we use,

python run.py \
--project=xyz \
--runner=DataflowRunner \
--staging_location=gs://xyz/staging \
--temp_location=gs://xyz/temp \
--requirements_file=requirements.txt \
--worker_machine_type n1-standard-8 \
--num_workers 2

pipeline snippet,

data = pipeline | "load data" >> beam.io.Read(    
    beam.io.BigQuerySource(query="SELECT * FROM abc_table LIMIT 100")
)

data | "filter data" >> beam.Filter(lambda x: x.get('column_name') == value)

Above pipeline is just loading the data from BigQuery and filtering based on some column value. This pipeline works like a charm in DirectRunner but fails on Dataflow.

Are we doing any obvious setup mistake? anyone else getting the same error? We could use some help to resolve the issue.

Update:

Our pipeline code is spread across multiple files, so we created a python package. We solved setup error problem by passing --setup_file argument instead of --requirements_file.

Rajesh Hegde · Accepted Answer

We resolved this setup error issue by sending a different set of arguments to the dataflow. Our code is spread across multiple files, so had to create a package for it. If we use --requirements_file, the job will start, but fail eventually, because it wouldn't be able to find the package in the workers. Beam Python SDK sometimes does not throw explicit error message for these instead, it will retry the job and fail. To get your code running with a package, you will need to pass --setup_file argument, which has dependencies listed in it. Make sure package created by python setup.py sdist command includes all the files required by your pipeline code.

If you have a privately hosted python package dependency then pass --extra_package with the path to the package.tar.gz file. Better way is to store in a GCS bucket and pass the path here.

I have written an example project to get started with Apache Beam Python SDK on Dataflow - https://github.com/RajeshHegde/apache-beam-example

Read about it here - https://medium.com/@rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366

Apache beam DataFlow runner throwing setup error

Tags:

python

apache-beam

google-cloud-dataflow

Update:

Rajesh Hegde

1 Answers

Rajesh Hegde

Recent Activity

Donate For Us

Apache beam DataFlow runner throwing setup error

Tags:

python

apache-beam

google-cloud-dataflow

Update:

Rajesh Hegde

1 Answers

Rajesh Hegde

Related questions

Recent Activity

Donate For Us