Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

I am using Google Data Flow to implement an ETL data ware house solution.

Looking into google cloud offering, it seems DataProc can also do the same thing.

It also seems DataProc is little bit cheaper than DataFlow.

Does anybody know the pros / cons of DataFlow over DataProc

Why does google offer both?

like image 880
KosiB Avatar asked Sep 26 '17 22:09

KosiB


People also ask

What is Google Dataproc used for?

Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them.

What is the difference between Dataproc and BigQuery?

For both small and large datasets, user queries' performance on the BigQuery Native platform was significantly better than that on the Spark Dataproc cluster. 2. Query cost for both On-Demand queries with BigQuery and Spark-based queries on Cloud DataProc is substantially high.

What is the difference between cloud composer and Dataflow?

Cloud Composer is a cross platform orchestration tool that supports AWS, Azure and GCP (and more) with management, scheduling and processing abilities. Cloud Dataflow handles tasks. Cloud Composer manages entire processes coordinating tasks that may involve BigQuery, Dataflow, Dataproc, Storage, on-premises, etc.


2 Answers

Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions.

An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles

Quick takeaways:

  • Cloud Dataproc provides you with a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if you are already familiar with Hadoop tools and have Hadoop jobs
  • Cloud Dataflow provides you with a place to run Apache Beam based jobs, on GCP, and you do not need to address common aspects of running jobs on a cluster (e.g. Balancing work, or Scaling the number of workers for a job; by default, this is automatically managed for you, and applies to both batch and streaming) -- this can be very time consuming on other systems
    • Apache Beam is an important consideration; Beam jobs are intended to be portable across "runners," which include Cloud Dataflow, and enable you to focus on your logical computation, rather than how a "runner" works -- In comparison, when authoring a Spark job, your code is bound to the runner, Spark, and how that runner works
    • Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values
like image 117
Andrew Mo Avatar answered Oct 13 '22 07:10

Andrew Mo


Here are three main points to consider while trying to choose between Dataproc and Dataflow

  • Provisioning
    Dataproc - Manual provisioning of clusters
    Dataflow - Serverless. Automatic provisioning of clusters

  • Hadoop Dependencies
    Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.

  • Portability
    Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.

This flowchart from the google website explains how to go about choosing one over the other.

Dataproc vs Dataflow https://cloud.google.com/dataflow/images/flow-vs-proc-flowchart.svg

Further details are available in the below link
https://cloud.google.com/dataproc/#fast--scalable-data-processing

like image 37
Kannappan Sirchabesan Avatar answered Oct 13 '22 05:10

Kannappan Sirchabesan