Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing. Looking at the Beam word count example, it feels it is very similar to the native Spark/Flink equivalents, maybe with a slightly more verbose syntax. I currently don't see a big benefit of choosing Beam over Spark/Flink for such a task. The only observations I can make so far: <ul> <li>Pro: Abstraction over different execution backends.</li> <li>Con: This abstraction comes at the price of having less control over what exactly is executed in Spark/Flink.</li> </ul> Are there better examples that highlight other pros/cons of the Beam model? Is there any information on how the loss of control affects performance? Note that I'm not asking for differences in the streaming aspects, which are partly covered in this question and summarized in this article (outdated due to Spark 1.X).

There's a few things that Beam adds over many of the existing engines. <ul> <li>Unifying batch and streaming. Many systems can handle both batch and streaming, but they often do so via separate APIs. But in Beam, batch and streaming are just two points on a spectrum of latency, completeness, and cost. There's no learning/rewriting cliff from batch to streaming. So if you write a batch pipeline today but tomorrow your latency needs change, it's incredibly easy to adjust. You can see this kind of journey in the Mobile Gaming examples.</li> <li>APIs that raise the level of abstraction: Beam's APIs focus on capturing properties of your data and your logic, instead of letting details of the underlying runtime leak through. This is both key for portability (see next paragraph) and can also give runtimes a lot of flexibility in how they execute. Something like ParDo fusion (aka function composition) is a pretty basic optimization that the vast majority of runners already do. Other optimizations are still being implemented for some runners. For example, Beam's Source APIs are specifically built to avoid overspecification the sharding within a pipeline. Instead, they give runners the right hooks to dynamically rebalance work across available machines. This can make a huge difference in performance by essentially eliminating straggler shards. In general, the more smarts we can build into the runners, the better off we'll be. Even the most careful hand tuning will fail as data, code, and environments shift.</li> <li>Portability across runtimes.: Because data shapes and runtime requirements are neatly separated, the same pipeline can be run in multiple ways. And that means that you don't end up rewriting code when you have to move from on-prem to the cloud or from a tried and true system to something on the cutting edge. You can very easily compare options to find the mix of environment and performance that works best for your current needs. And that might be a mix of things -- processing sensitive data on premise with an open source runner and processing other data on a managed service in the cloud. </li> </ul> Designing the Beam model to be a useful abstraction over many, different engines is tricky. Beam is neither the intersection of the functionality of all the engines (too limited!) nor the union (too much of a kitchen sink!). Instead, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines. <ul> <li> Keyed State is a great example of functionality that existed in various engines and enabled interesting and common use cases, but wasn't originally expressible in Beam. We recently expanded the Beam model to include a version of this functionality according to Beam's design principles.</li> <li>And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam (née Dataflow) model. </li> <li>This also means that the capabilities will not always be exactly the same across different Beam runners at a given point in time. So that's why we're using capability matrix to try to clearly communicate the state of things. </li> </ul>

What are the benefits of Apache Beam over Spark/Flink for batch processing?

1 Answers

There's a few things that Beam adds over many of the existing engines.

Unifying batch and streaming. Many systems can handle both batch and streaming, but they often do so via separate APIs. But in Beam, batch and streaming are just two points on a spectrum of latency, completeness, and cost. There's no learning/rewriting cliff from batch to streaming. So if you write a batch pipeline today but tomorrow your latency needs change, it's incredibly easy to adjust. You can see this kind of journey in the Mobile Gaming examples.
APIs that raise the level of abstraction: Beam's APIs focus on capturing properties of your data and your logic, instead of letting details of the underlying runtime leak through. This is both key for portability (see next paragraph) and can also give runtimes a lot of flexibility in how they execute. Something like ParDo fusion (aka function composition) is a pretty basic optimization that the vast majority of runners already do. Other optimizations are still being implemented for some runners. For example, Beam's Source APIs are specifically built to avoid overspecification the sharding within a pipeline. Instead, they give runners the right hooks to dynamically rebalance work across available machines. This can make a huge difference in performance by essentially eliminating straggler shards. In general, the more smarts we can build into the runners, the better off we'll be. Even the most careful hand tuning will fail as data, code, and environments shift.
Portability across runtimes.: Because data shapes and runtime requirements are neatly separated, the same pipeline can be run in multiple ways. And that means that you don't end up rewriting code when you have to move from on-prem to the cloud or from a tried and true system to something on the cutting edge. You can very easily compare options to find the mix of environment and performance that works best for your current needs. And that might be a mix of things -- processing sensitive data on premise with an open source runner and processing other data on a managed service in the cloud.

Designing the Beam model to be a useful abstraction over many, different engines is tricky. Beam is neither the intersection of the functionality of all the engines (too limited!) nor the union (too much of a kitchen sink!). Instead, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines.

Keyed State is a great example of functionality that existed in various engines and enabled interesting and common use cases, but wasn't originally expressible in Beam. We recently expanded the Beam model to include a version of this functionality according to Beam's design principles.
And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam (née Dataflow) model.
This also means that the capabilities will not always be exactly the same across different Beam runners at a given point in time. So that's why we're using capability matrix to try to clearly communicate the state of things.

114

answered Oct 22 '22 10:10

Frances

Related questions
                            
                                How does HashPartitioner work?
                            
                                How to link PyCharm with PySpark?
                            
                                How to pass -D parameter or environment variable to Spark job?
                            
                                Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame
                            
                                How to write unit tests in Spark 2.0+?
                            
                                Updating a dataframe column in spark
                            
                                Spark SQL: apply aggregate functions to a list of columns
                            
                                Get current number of partitions of a DataFrame
                            
                                How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4
                            
                                Overwrite specific partitions in spark dataframe write method
                            
                                Concatenate two PySpark dataframes
                            
                                Split Spark Dataframe string column into multiple columns
                            
                                How to export a table dataframe in PySpark to csv?
                            
                                Mac spark-shell Error initializing SparkContext
                            
                                How to save DataFrame directly to Hive?
                            
                                How to set up Spark on Windows?
                            
                                At what situation I can use Dask instead of Apache Spark? [closed]
                            
                                What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?
                            
                                Is there a way to take the first 1000 rows of a Spark Dataframe?
                            
                                How do I set the driver's python version in spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are the benefits of Apache Beam over Spark/Flink for batch processing?

Tags:

apache-spark

apache-flink

apache-beam

bluenote10

People also ask

1 Answers

Frances

Recent Activity

Donate For Us