We are building rather complex Dataflow jobs in that compute models from a streaming source. In particular, we have two models that share a bunch of metrics and that are computed off roughly the same data source. The jobs perform joins on slightly large datasets.
Do you have any guidelines on how to design that kind of jobs? Any metrics, behaviors, or anything we have to consider in oder to make the decision?
Here are a couple options that we have in mind and how we thing they compare:
Implement everything in one, large job. Factor common metrics, and then compute model specific metrics.
Extract out the common metrics computation to a dedicated job, thus resulting in 3 jobs, wired together using Pub/Sub.
You've already mentioned many of the key tradeoffs here -- modularity and smaller failure domains vs. operational overhead and the potential complexity of a monolithic system. Another point to be aware of is cost -- the Pub/Sub traffic will increase the price of the multiple pipelines solution.
Without knowing the specifics of your operation better, my advice would be to go with option #2. It sounds like there is at least partial value in having a subset of the models up, and in the event of a critical bug or regression, you'll be able to make partial progress while looking for a fix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With