Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Benefits with Dataflow over cloud functions when moving data?

I'm relatively new to GCP and just starting to setup/evaluate my organizations architecture on GCP.

Scenario:
Data will flow into a pub/sub topic (high frequency, low amount of data). The goal is to move that data into Big Table. From my understanding you can do that either with a having a cloud function triggering on the topic or with Dataflow.

Now I have previous experience with cloud functions which I am satisfied with, so that would be my pick.

I fail to see the benefit of choosing one over the other. So my question is when to choose what of these products?

Thanks

like image 639
Tsuni Avatar asked Jul 05 '18 18:07

Tsuni


People also ask

When should I use cloud Dataproc over Cloud Dataflow?

Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem. Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine.

What are the benefits of Dataflow streaming engine?

The key benefits of Cloud Dataflow service include: Elimination of operational overhead for data engineering workloads. Low latency for building streaming data pipelines. Cost-optimized for sudden spikes in workload.

Why is Dataflow used?

Dataflow templates allow you to easily share your pipelines with team members and across your organization or take advantage of many Google-provided templates to implement simple but useful data processing tasks. This includes Change Data Capture templates for streaming analytics use cases.

What does Cloud Dataflow use to support fast and simplified pipeline development?

The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service.


2 Answers

Both solutions could work. Dataflow will scale better if your pub/sub traffic grows to large amounts of data, but Cloud Functions should work fine for low amounts of data; I would look at this page (especially the rate-limit section) to ensure that you fit within Cloud Functions: https://cloud.google.com/functions/quotas

Another thing to consider is that Dataflow can guarantee exactly-once processing of your data, so that no duplicates end up in BigTable. Cloud Functions will not do this for you out of the box. If you go with a functions approach, then you will want to make sure that the Pub/Sub message consistently determines which BigTable cell is written to; that way, if the function gets retried several times the same data will simply overwrite the same BigTable cell.

like image 83
Reuven Lax Avatar answered Oct 01 '22 11:10

Reuven Lax


Your needs sound relatively straightforward and Dataflow may be overkill for what you're trying to do. If Cloud functions do what you need they maybe stick with that. Often I find that simplicity is key when it comes to maintainability.

However when you need to perform transformations like merging these events by user before storing them in BigTable, that's where Dataflow really shines:

https://beam.apache.org/documentation/programming-guide/#groupbykey

like image 27
Alex Avatar answered Oct 01 '22 12:10

Alex