Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoiding multiple streaming queries

I have a structured streaming query which sinks to Kafka. This query has a complex aggregation logic.

I would like to sink the output DF of this query to multiple Kafka topics each partitioned on a different ‘key’ column. I don't want to have multiple Kafka sinks for each of the different Kafka topics because that would mean running multiple streaming queries - one for each Kafka topic, especially since my aggregation logic is complex.

Questions:

  1. Is there a way to output the results of a structured streaming query to multiple Kafka topics each with a different key column but without having to execute multiple streaming queries?

  2. If not, would it be efficient to cascade the multiple queries such that the first query does the complex aggregation and writes output to Kafka and then the other queries just read the output of the first query and write their topics to Kafka thus avoiding doing the complex aggregation again?

Thanks in advance for any help.

like image 941
Priyank Shrivastava Avatar asked Feb 13 '18 02:02

Priyank Shrivastava


1 Answers

So the answer was kind of staring at me in the eye. It's documented as well. Link below.

One can write to multiple Kafka topics from a single query. If your dataframe that you want to write has a column named "topic" (along with "key", and "value" columns), it will write the contents of a row to the topic in that row. This automatically works. So the only thing you need to figure out is how to generate the value of that column.

This is documented - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka

like image 118
Priyank Shrivastava Avatar answered Sep 22 '22 15:09

Priyank Shrivastava