I have been diving into a Stackdriver Trace integration on Google Cloud Run. I can get it to work with the agent, but I am bothered by a few questions.
I went into the source code for the Cloud Endpoints ESP, (the Cloud Run integration is in beta) to see if they solve it in a different way, but there the same pattern is used: there is a buffer with traces (1s) and it is cleared periodically.
While my tracing integration seems to work in my test setup, I am worried about incomplete and missing traces when I run this in a production environment.
Is this a hypothetical problem or a real issue?
It looks like the right way to approach this is to write telemetry to logs, instead of using an agent process. Is that supported with Stackdriver Trace?
Is this a hypothetical problem or a real issue?
If you consider a Cloud Run service receiving a single request, then it is definitely a problem, as the library will not have time to flush the data before the CPU of the container instance get throttled.
However, in real life use cases:
Note that Trace libraries usually themselves sample the requests to trace, they rarely trace 100% of the requests.
It looks like the right way to approach this is to write telemetry to logs, instead of using an agent process. Is that supported with Stackdriver Trace?
No, Stackdriver Trace takes its data from the spans sent to its API. Note that to send data to Stackdriver Trace, you can use libraryes like OpenCenss and OpenTelemetry, proprietary Stackdriver Trace libraries are not the recommended way anymre.
You're right. This is a fair concern since most tracing libraries tend to sample/upload trace spans in the background.
Since (1) your CPU is nearly scaled nearly to zero when the container isn't handling any requests and (2) the container instance can be killed any time due to inactivity, you cannot reliably upload those trace spans collected in your app. As you said, it may sometimes work since we don't fully stop CPU, but it won't always work.
It appears like some of the Stackdriver (and/or OpenTelemetry f.k.a. OpenCensus) libraries let you control the lifecycle of pushing trace spans.
For example, this Go package for OpenCensus Stackdriver exporter has a Flush()
method that you can call before completing your request rather than relying on the runtime to periodically upload the trace spans: https://godoc.org/contrib.go.opencensus.io/exporter/stackdriver#Exporter.Flush
I assume other tracing libraries in other languages also expose similar Flush() methods, if not, please let me know in the comments and this would be a valid feature request to those libraries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With