I had a Cloud Dataflow pipeline fail after about 14 worker-hours with the following cryptic log message:
Mar 29, 2016, 8:18:16 PM (3253bcfbb8c9c2a7): Workflow failed. Causes: (2bfe8449fe3ba464): S745 (STAGE REDACTED) Causes: (1a6d5387c382ba3a): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on: (WORKERS REDACTED)
I glanced quickly at worker logs and it wasn't immediately obvious what was happening either. Is there supposed to be something to those cause codes?
The troubleshooting guide wasn't particular elucidating here, either. My best guess was that it fell under the "shuffle-bound" category (the job is very shuffle intensive), but none of the errors listed are present in the logs.
Thanks!
Test the PipelineFor every source of input data to your pipeline, create some known static test input data. Create some static test output data that matches what you expect in your pipeline's final output PCollection (s). Create a TestPipeline in place of the standard Pipeline. create .
Data moves from one component to the next via a series of pipes. Data flows through each pipe from left to right. A "pipeline" is a series of pipes that connect components together so they form a protocol.
You can execute a streaming Dataflow job using this template, via the REST API, using a unique job name to ensure that there is only one instance of this Dataflow job running at any point in time. If the job were cancelled, you could (re)start this streaming Dataflow job by running the same command again.
Two different tools that solve different problems. Dataflow allows you to build scalable data processing pipelines (Batch & Stream). Composer is used to schedule, orchestrate and manage data pipelines.
I looked up your job by the error IDs and it seems that the work items were repeatedly failing due to out of memory errors (the Java process was killed by OOM killer, unfortunately without getting a chance to write a heap dump - search for "oom-killer" in the cloud logs to find the relevant entries).
Unfortunately all I can suggest with this information is, consider using a bigger instance type or optimizing memory usage of your transforms (e.g. make sure they're not buffering a lot of data in memory).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With