Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google Cloud Data flow stuck with repeated error 'Error syncing pod...failed to "StartContainer" for "sdk" with CrashLoopBackOff'

SDK: Apache Beam SDK for Go 0.5.0

Our Golang job has been running fine on Google Cloud Data flow for weeks. We haven't made any updates to the job itself and the SDK version seems to be the same as it has been. Last night it failed, and I'm not sure exactly why. It gets to the 1 hour time limit and the job is cancelled due to no worker activity.

Looking at the Stackdriver logs the only thing I can see that stands out is repeated errors with Error syncing pod...failed to "StartContainer" for "sdk" with CrashLoopBackOff

It seems that it's somehow failing to sync the pod(?) and thus waiting 5 minutes before retrying.

Could anyone shed some light on what might be causing this and how we might go about either finding more information, or diagnosing the cause of the problem?

Note: I checked the status for Google Cloud Data flow and there doesn't appear to be any outages with the service.

like image 278
Tim Avatar asked Dec 12 '18 02:12

Tim


1 Answers

We had something similar and found that is was an inability to start the workers (for us due to an slf4j issue, but it could be anything that prevents the worker from starting in whatever language).

If you look at the Stackdriver Logs (view Logs in the UI, and click the link to go to Stackdriver) you should be able to view the worker_startup logs.

like image 152
andrewrjones Avatar answered Oct 22 '22 18:10

andrewrjones