How can I modify the apache beam DirectRunner to make it faster?

Question

We are building a framework using apache beam, and there are some use cases (like late data) that we simply don't need. What we do need however, is blazing-fast performance.

I'm trying to modify the DirectRunner to be faster. So far I've commented out Metrics and Enforcements, which yielded a small performance increase. We've also removed SynchronizedProcessingTimeOutputWatermark and SynchronizedProcessingTimeInputWatermark without affecting our use case, which suggests that there may be more (Watermarks? Holds?) which could be removed.

I realise this is a rather vague question, but what else can I remove to make it faster?

Ben Chambers · Accepted Answer

Have you tried running it with a profiler on your use cases? These would identify the hot spots that are affecting your performance.

You should know that the DirectRunner was written to demonstrate the model and validate the pipeline code. It wasn't written with performance or scalability as goals. If you're looking to run Beam pipelines efficiently, you may want to look at one of the distributed runners (such as Spark, Flink, Dataflow, etc.) since they will all perform better on large data.

robertwb · Answer

You could also consider using a "production" runner such as Flink in local mode.

How can I modify the apache beam DirectRunner to make it faster?

Tags:

apache-beam

mdarwin

2 Answers

Ben Chambers

robertwb

Recent Activity

Donate For Us

How can I modify the apache beam DirectRunner to make it faster?

Tags:

apache-beam

mdarwin

2 Answers

Ben Chambers

robertwb

Related questions

Recent Activity

Donate For Us