Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Flink: Using filter() or split() to split a stream?

I have a DataStream from Kafka which has 2 possible value for a field in MyModel. MyModel is a pojo with domain-specific fields parsed from a message from Kafka.

DataStream<MyModel> stream = env.addSource(myKafkaConsumer);

I want to apply window and operators on each key a1, a2 separately. What is a good way to separate them? I have 2 options filter and select in mind but don't know which one is faster.

Filter approach

stream
        .filter(<MyModel.a == a1>)
        .keyBy()
        .window()
        .apply()
        .addSink()

stream
        .filter(<MyModel.a == a2>)
        .keyBy()
        .window()
        .apply()
        .addSink()

Split and select approach

SplitStream<MyModel> split = stream.split(…)
    split
        .select(<MyModel.a == a1>)
        …
        .addSink()

    split
        .select<MyModel.a == a2>()
        …
        .addSink()

If split and select are better, how to implement them if I want to split based on the value of a field in MyModel?

like image 555
Son Avatar asked Dec 03 '18 06:12

Son


2 Answers

Both methods behave pretty much the same. Internally, the split() operator forks the stream and applies filters as well.

There is a third option, Side Outputs . Side outputs might have some benefits, such as different output data types. Moreover, the filter condition is just evaluated once for side outputs.

like image 151
Fabian Hueske Avatar answered Sep 21 '22 15:09

Fabian Hueske


SplitStreams and split method in DataStream are deprecated since Flink Deprecated List 1.6. It is no longer recommended to be used.

like image 32
Monika X Avatar answered Sep 21 '22 15:09

Monika X