I am surveying Google Dataflow and Apache Spark to decide which one is more suitable solution for our bigdata analysis business needs. I found there are <code>Spark SQL</code> and <code>MLlib</code> in the spark platform to do structured data query and machine learning. I wonder is there any corresponding solution in the Google Dataflow platform?

It would help if you could expand a bit on your specific use case(s). What are you trying to accomplish in relation to "Bigdata analysis"? The short answer... it depends :-) Here are some key architectural points to consider in relation to Google Cloud Dataflow v. Spark and Hadoop MR. <ul> <li>Resource Mgmt: Cloud Dataflow is a completely on demand execution environment. Specifically - when you execute a job in Dataflow the resources are allocated on demand for that job only. There is no sharing/contention of resources across jobs. In comparison to a Spark or MapReduce cluster you would typically deploy a cluster of X nodes and then submit jobs and then tune the node resources across jobs. Of course you can build up and tear down these clusters, but the Dataflow model is geared towards hands free dev ops in relation to resource management. If want to optimize resource usage to job demands Dataflow is a solid model to control cost and nearly forget about resource tuning. If you prefer a multi-tenant style cluster I'd suggest you look at Google Cloud Dataproc as it provides the on demand cluster management aspects like Dataflow, but focused on class Hadoop workloads like MR, Spark, Pig, ...</li> <li>Interactivity: Currently Cloud Dataflow does not provide an interactive mode. Meaning once you submit a job the work resources are bound to the graph that was submitted AND the majority of the data is loaded into resources as needed. Spark can be a better model if you want to load data into the cluster via in memory RDD's and then dynamically execute queries. The challenge is that as your data sizes and query complexity increases you will have to handle the devOps. Now if most of your queries can be expressed in SQL syntax you may want to look at BigQuery. BigQuery provides the "on demand" aspects of Dataflow and enables you to interactively execute queries over massive amounts of data e.g petabytes. The biggest advantage in my opinion of BigQuery is that you do not have think/worry about hardware allocation to deal with your data sizes. Meaning as your data sizes grow you don't have to think about hardware (memory and disk size) configuration.</li> <li>Programming Model: Dataflow's programming model is functionally biased vs. a classic MapReduce model. There are many similarities between Spark and Dataflow in terms of API primitives. Things to consider: 1) Dataflow's primary programming language is Java. There is a Python SDK in the works. The Dataflow Java SDK in open sourced and has been ported to Scala. Today, Spark has more SDK surface choice with GraphX, Streaming, Spark SQL, and ML. 2) Dataflow is a unified programming model for batch and streaming based DAG development. The goal was to remove the complexity and cost switching when moving between batch and streaming models. The same graph can seamlessly run in either mode. 3) Today, Cloud Dataflow does not support converging/iterative based graph execution. If you need the power of something like MLib then Spark is the way to go. Keep in mind this is the state of things today. </li> <li>Streaming & Windowing: Dataflow (building on top of the unified programming model) was architected to be a highly reliable, durable, and scalable execution environment for streaming. One of the key differences between Dataflow and Spark is that Dataflow enables you to easily process data in terms of its true event time vs. solely processing it at it's arrival time into the graph. You can window data into fixed, sliding, session or custom windows based on event time or arrival time. Dataflow also provides Triggers (applied to Windows) that enable you to control how you want to handle late arriving data. Net-net you dial in the level of correctness control to meet the needs of your analysis. For example, lets say you have a mobile game that interacts with a 100 edge nodes. These nodes create 10000's events second related to game play. Let's say a group of nodes can't communicate with your back end streaming analysis system. In the case of Dataflow - once that data does arrive - you can control how you'd like to handle the data in relation to your query correctness needs. Dataflow also provides the ability to upgrade your streaming jobs while they are in flight. For example, let's say you discover a logical bug in a transform. You can upgrade your in flight job without losing your existing Windowed state. Net-net you can keep you business running.</li> </ul> Net-net: - if you are really primarily doing ETL style work (filtering, shaping, joining, ...) or batch style MapReduce Dataflow is a great path if you want minimal devOps. - if you need to implement ML style graphs, go the Spark path and give Dataproc a try - if you are doing ML and you first need to do ETL to clean up your training data implement a hybrid with Dataflow and Dataproc - if you need interactivity Spark is a solid choice, but so is BigQuery if you are/can express your queries in SQL - if you need to process your ETL and or MR jobs over streams, Dataflow is a solid choice. <hr> So... what are you scenarios?

Google Dataflow vs Apache Spark

2 Answers

It would help if you could expand a bit on your specific use case(s). What are you trying to accomplish in relation to "Bigdata analysis"? The short answer... it depends :-)

Here are some key architectural points to consider in relation to Google Cloud Dataflow v. Spark and Hadoop MR.

Resource Mgmt: Cloud Dataflow is a completely on demand execution environment. Specifically - when you execute a job in Dataflow the resources are allocated on demand for that job only. There is no sharing/contention of resources across jobs. In comparison to a Spark or MapReduce cluster you would typically deploy a cluster of X nodes and then submit jobs and then tune the node resources across jobs. Of course you can build up and tear down these clusters, but the Dataflow model is geared towards hands free dev ops in relation to resource management. If want to optimize resource usage to job demands Dataflow is a solid model to control cost and nearly forget about resource tuning. If you prefer a multi-tenant style cluster I'd suggest you look at Google Cloud Dataproc as it provides the on demand cluster management aspects like Dataflow, but focused on class Hadoop workloads like MR, Spark, Pig, ...
Interactivity: Currently Cloud Dataflow does not provide an interactive mode. Meaning once you submit a job the work resources are bound to the graph that was submitted AND the majority of the data is loaded into resources as needed. Spark can be a better model if you want to load data into the cluster via in memory RDD's and then dynamically execute queries. The challenge is that as your data sizes and query complexity increases you will have to handle the devOps. Now if most of your queries can be expressed in SQL syntax you may want to look at BigQuery. BigQuery provides the "on demand" aspects of Dataflow and enables you to interactively execute queries over massive amounts of data e.g petabytes. The biggest advantage in my opinion of BigQuery is that you do not have think/worry about hardware allocation to deal with your data sizes. Meaning as your data sizes grow you don't have to think about hardware (memory and disk size) configuration.
Programming Model: Dataflow's programming model is functionally biased vs. a classic MapReduce model. There are many similarities between Spark and Dataflow in terms of API primitives. Things to consider: 1) Dataflow's primary programming language is Java. There is a Python SDK in the works. The Dataflow Java SDK in open sourced and has been ported to Scala. Today, Spark has more SDK surface choice with GraphX, Streaming, Spark SQL, and ML. 2) Dataflow is a unified programming model for batch and streaming based DAG development. The goal was to remove the complexity and cost switching when moving between batch and streaming models. The same graph can seamlessly run in either mode. 3) Today, Cloud Dataflow does not support converging/iterative based graph execution. If you need the power of something like MLib then Spark is the way to go. Keep in mind this is the state of things today.
Streaming & Windowing: Dataflow (building on top of the unified programming model) was architected to be a highly reliable, durable, and scalable execution environment for streaming. One of the key differences between Dataflow and Spark is that Dataflow enables you to easily process data in terms of its true event time vs. solely processing it at it's arrival time into the graph. You can window data into fixed, sliding, session or custom windows based on event time or arrival time. Dataflow also provides Triggers (applied to Windows) that enable you to control how you want to handle late arriving data. Net-net you dial in the level of correctness control to meet the needs of your analysis. For example, lets say you have a mobile game that interacts with a 100 edge nodes. These nodes create 10000's events second related to game play. Let's say a group of nodes can't communicate with your back end streaming analysis system. In the case of Dataflow - once that data does arrive - you can control how you'd like to handle the data in relation to your query correctness needs. Dataflow also provides the ability to upgrade your streaming jobs while they are in flight. For example, let's say you discover a logical bug in a transform. You can upgrade your in flight job without losing your existing Windowed state. Net-net you can keep you business running.

Net-net: - if you are really primarily doing ETL style work (filtering, shaping, joining, ...) or batch style MapReduce Dataflow is a great path if you want minimal devOps.
- if you need to implement ML style graphs, go the Spark path and give Dataproc a try - if you are doing ML and you first need to do ETL to clean up your training data implement a hybrid with Dataflow and Dataproc - if you need interactivity Spark is a solid choice, but so is BigQuery if you are/can express your queries in SQL - if you need to process your ETL and or MR jobs over streams, Dataflow is a solid choice.

So... what are you scenarios?

answered Sep 24 '22 23:09

Eric Schmidt

I've tried both :

Dataflow is still very young, the is no "out-of-the-box" solution for doing ML with it (even though you could implement algorithms in transforms), you could output the processes data to cloud storage and read it later with another tool.

Spark would be recommended but you would have to manage your cluster yourself. However there is a good alternative: Google Dataproc

You can develop analysis tools with spark and deploy them with one command on your cluster, dataproc will manage the cluster itself without having to tweak the configuration.

answered Sep 25 '22 23:09

Paul K.

Related questions
                            
                                How to save a spark DataFrame as csv on disk?
                            
                                How to use AND or OR condition in when in Spark
                            
                                Read multiline JSON in Apache Spark
                            
                                Map can not be serializable in scala?
                            
                                Trim string column in PySpark dataframe
                            
                                SparkSQL: How to deal with null values in user defined function?
                            
                                How spark read a large file (petabyte) when file can not be fit in spark's main memory
                            
                                Pyspark: get list of files/directories on HDFS path
                            
                                Create spark dataframe schema from json schema representation
                            
                                Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values
                            
                                Spark / Scala: forward fill with last observation
                            
                                How do I stop a spark streaming job?
                            
                                Spark final task takes 100x times longer than first 199, how to improve
                            
                                How to find the master URL for an existing spark cluster
                            
                                What's the most efficient way to filter a DataFrame
                            
                                Warnings while building Scala/Spark project with SBT
                            
                                Spark DataFrame: does groupBy after orderBy maintain that order?
                            
                                Difference between createOrReplaceTempView and registerTempTable
                            
                                Adding a group count column to a PySpark dataframe
                            
                                how to get max(date) from given set of data grouped by some fields using pyspark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Google Dataflow vs Apache Spark

Tags:

distributed-computing

apache-spark

google-cloud-dataflow

google-cloud-ml

Browny Lin

People also ask

2 Answers

Eric Schmidt

Paul K.

Recent Activity

Donate For Us