Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Structured Streaming - Limitations? (Source Performance, Unsupported Operations, Spark UI)

I have started to explore Spark Structured Streaming to write some applications having been using DStreams before this.

I am trying to understand the limitations of Structured Streaming as I have started to use it but would like to know the draw backs if any.

Q1. For each sink in the structured streaming app, it will independently read from a source (eg. Kafka). Meaning if you read from one topic A, and write to 3 places (e.g. ES, Kafka, S3) it will actually set up 3 source connections independent of each other.

Will this be a performance degradation? As it will require 3 independent connections managed instead of one (DStream approach)

Q2. I know that joining 2 streaming data sets is unsupported. How can I perform calculations on 2 streams?

If I have data from topic A and other data from topic B, is it possible to do calculations on both of these somehow?

Q3. In Spark Streaming UI, there is a Streaming tab for metrics and to view the throughput of the application. In Structured Streaming this is not available anymore.

Why is this? Is the intention to obtain all metrics programmatically and push to a separate monitoring service?

like image 268
bp2010 Avatar asked May 23 '18 08:05

bp2010


1 Answers

For each sink in the structured streaming app, it will independently read from a source (eg. Kafka). Meaning if you read from one topic A, and write to 3 places (e.g. ES, Kafka, S3) it will actually set up 3 source connections independent of each other.

Out of the box, yes. Each Sink describes a different execution flow. But, you can get around this by not using built-in sinks and creating your own custom one, which controls how you do the writes.

Will this be a performance degradation? As it will require 3 independent connections managed instead of one (DStream approach)

Probably. You usually don't want to read and process the same thing over and over again only because you have more than one Sink for the same source. Again, you can accommodate this by building your own Sink (which shouldn't be too much work)

Q2. I know that joining 2 streaming data sets is unsupported. How can I perform calculations on 2 streams?

As of Spark 2.3, this is supported OOTB.

Q3. In Spark Streaming UI, there is a Streaming tab for metrics and to view the throughput of the application. In Structured Streaming this is not available anymore. Why is this? Is the intention to obtain all metrics programmatically and push to a separate monitoring service?

You're right. The fancy UI you had in Structured Streaming doesn't exist (yet) in Spark. I've asked Michael Armburst this question and his answer was "priorities", they simply haven't had the time to put in work to create something as fancy as Spark Streaming had because they wanted to squeeze more features in. The good thing about OSS is you can always contribute the missing part yourself if you need it.

All in all, Structured Streaming is the future for Spark. No more work is being invested in DStreams. For high throughput systems, I can say there is a significant benefit in joining the Structured Streaming bandwagon. I've switched over once 2.1 was released and it is definitely a performance boost, especially in the areas of stateful streaming pipelines.

like image 83
Yuval Itzchakov Avatar answered Sep 21 '22 15:09

Yuval Itzchakov