I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?
hi @sɐunıɔןɐqɐp, there is an option clone at the top of the jobs dashboard in the dataflow. It will simply clone the current job and let you edit and run it again.
You cannot delete a Dataflow job; you can only stop it. To stop a Dataflow job, you can use either the Google Cloud console, Cloud Shell, a local terminal installed with the Google Cloud CLI, or the Dataflow REST API.
To view the Job Logs generated by your pipeline code and the Dataflow service, in the Logs panel, click segmentShow. You can filter the messages that appear in Job logs by clicking Infoarrow_drop_down and filter_listFilter. To only display error messages, click Infoarrow_drop_down and select Error.
Access to Dataflow is governed by Google service accounts. A service account is used by the Dataprep by Trifacta application to access services and resources in the Google Cloud Platform. A service account can be used by one or more users, who are accessing the platform.
The first resource that you should check is Dataflow documentation. It should be useful to check these:
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
setup.py
file?To debug this sort of issue I usually open StackDriver logging, and look for worker-startup
logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py
, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
View.AsList
for a side input.GroupByKey
operations?A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle
and finishBundle
calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle
, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner
instead of GroupByKey
+ ParDo
.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With