SDK: Apache Beam SDK for Go 0.5.0
Our Golang job has been running fine on Google Cloud Data flow for weeks. We haven't made any updates to the job itself and the SDK version seems to be the same as it has been. Last night it failed, and I'm not sure exactly why. It gets to the 1 hour time limit and the job is cancelled due to no worker activity.
Looking at the Stackdriver logs the only thing I can see that stands out is repeated errors with Error syncing pod...failed to "StartContainer" for "sdk" with CrashLoopBackOff
It seems that it's somehow failing to sync the pod(?) and thus waiting 5 minutes before retrying.
Could anyone shed some light on what might be causing this and how we might go about either finding more information, or diagnosing the cause of the problem?
Note: I checked the status for Google Cloud Data flow and there doesn't appear to be any outages with the service.
We had something similar and found that is was an inability to start the workers (for us due to an slf4j issue, but it could be anything that prevents the worker from starting in whatever language).
If you look at the Stackdriver Logs (view Logs in the UI, and click the link to go to Stackdriver) you should be able to view the worker_startup
logs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With