We are building a framework using apache beam, and there are some use cases (like late data) that we simply don't need. What we do need however, is blazing-fast performance.
I'm trying to modify the DirectRunner to be faster. So far I've commented out Metrics and Enforcements, which yielded a small performance increase. We've also removed SynchronizedProcessingTimeOutputWatermark and SynchronizedProcessingTimeInputWatermark without affecting our use case, which suggests that there may be more (Watermarks? Holds?) which could be removed.
I realise this is a rather vague question, but what else can I remove to make it faster?
Have you tried running it with a profiler on your use cases? These would identify the hot spots that are affecting your performance.
You should know that the DirectRunner was written to demonstrate the model and validate the pipeline code. It wasn't written with performance or scalability as goals. If you're looking to run Beam pipelines efficiently, you may want to look at one of the distributed runners (such as Spark, Flink, Dataflow, etc.) since they will all perform better on large data.
You could also consider using a "production" runner such as Flink in local mode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With