I'm relatively new to GCP and just starting to setup/evaluate my organizations architecture on GCP.
Scenario:
Data will flow into a pub/sub topic (high frequency, low amount of data). The goal is to move that data into Big Table. From my understanding you can do that either with a having a cloud function triggering on the topic or with Dataflow.
Now I have previous experience with cloud functions which I am satisfied with, so that would be my pick.
I fail to see the benefit of choosing one over the other. So my question is when to choose what of these products?
Thanks
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem. Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine.
The key benefits of Cloud Dataflow service include: Elimination of operational overhead for data engineering workloads. Low latency for building streaming data pipelines. Cost-optimized for sudden spikes in workload.
Dataflow templates allow you to easily share your pipelines with team members and across your organization or take advantage of many Google-provided templates to implement simple but useful data processing tasks. This includes Change Data Capture templates for streaming analytics use cases.
The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service.
Both solutions could work. Dataflow will scale better if your pub/sub traffic grows to large amounts of data, but Cloud Functions should work fine for low amounts of data; I would look at this page (especially the rate-limit section) to ensure that you fit within Cloud Functions: https://cloud.google.com/functions/quotas
Another thing to consider is that Dataflow can guarantee exactly-once processing of your data, so that no duplicates end up in BigTable. Cloud Functions will not do this for you out of the box. If you go with a functions approach, then you will want to make sure that the Pub/Sub message consistently determines which BigTable cell is written to; that way, if the function gets retried several times the same data will simply overwrite the same BigTable cell.
Your needs sound relatively straightforward and Dataflow may be overkill for what you're trying to do. If Cloud functions do what you need they maybe stick with that. Often I find that simplicity is key when it comes to maintainability.
However when you need to perform transformations like merging these events by user before storing them in BigTable, that's where Dataflow really shines:
https://beam.apache.org/documentation/programming-guide/#groupbykey
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With