I am a newbie with apache flink. I have an unbound data stream in my input (fed into flink 0.10 via kakfa).
I want to get the 1st occurence of each primary key (the primary key is the contract_num and the event_dt).
These "duplicates" occur nearly immediately after each other.
The source system cannot filter this for me, so flink has to do it.
Here is my input data:
contract_num, event_dt, attr
A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:08, Y
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C
Here is the output data I want:
A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C
Note the 2nd row has been removed as the key combination of A001 and '2016-02-24 10:25:08' already occurred in the 1st row.
How can I do this with flink 0.10?
I was thinking about using keyBy(0,1)
but after that I don't know what to do!
(I used joda-time and org.flinkspector to setup these tests).
@Test
public void test() {
DateTime threeSecondsAgo = (new DateTime()).minusSeconds(3);
DateTime twoSecondsAgo = (new DateTime()).minusSeconds(2);
DateTime oneSecondsAgo = (new DateTime()).minusSeconds(2);
DataStream<Tuple3<String, Date, String>> testStream =
createTimedTestStreamWith(
Tuple3.of("A1", threeSecondsAgo.toDate(), "X"))
.emit(Tuple3.of("A1", threeSecondsAgo.toDate(), "Y"), after(0, TimeUnit.NANOSECONDS))
.emit(Tuple3.of("A1", twoSecondsAgo.toDate(), "Z"), after(0, TimeUnit.NANOSECONDS))
.emit(Tuple3.of("A2", oneSecondsAgo.toDate(), "C"), after(0, TimeUnit.NANOSECONDS))
.close();
testStream.keyBy(0,1);
}
It allows you to detect event patterns in an endless stream of events, giving you the opportunity to get hold of what’s important in your data. This page describes the API calls available in Flink CEP.
FlinkCEP is the Complex Event Processing (CEP) library implemented on top of Flink. It allows you to detect event patterns in an endless stream of events, giving you the opportunity to get hold of what’s important in your data. This page describes the API calls available in Flink CEP.
FlinkCEP - Complex event processing for Flink # FlinkCEP is the Complex Event Processing (CEP) library implemented on top of Flink. It allows you to detect event patterns in an endless stream of events, giving you the opportunity to get hold of what’s important in your data.
FlinkCEP supports the following forms of contiguity between events: Strict Contiguity: Expects all matching events to appear strictly one after the other, without any non-matching events in-between. Relaxed Contiguity: Ignores non-matching events appearing in-between the matching ones.
Filtering duplicates over an infinite stream will eventually fail if your key space is larger than your available storage space. The reason is that you have to store the already seen keys somewhere to filter out the duplicates. Thus, it would be good to define a time window after which you can purge the current set of seen keys.
If you're aware of this problem but want to try it anyway, you can do it by applying a stateful flatMap
operation after the keyBy
call. The stateful mapper uses Flink's state abstraction to store whether it has already seen an element with this key or not. That way, you will also benefit from Flink's fault tolerance mechanism because your state will be automatically checkpointed.
A Flink program doing your job could look like
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple3<String, Date, String>> input = env.fromElements(Tuple3.of("foo", new Date(1000), "bar"), Tuple3.of("foo", new Date(1000), "foobar"));
input.keyBy(0, 1).flatMap(new DuplicateFilter()).print();
env.execute("Test");
}
where the implementation of DuplicateFilter
depends on the version of Flink.
public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {
static final ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class, false);
private ValueState<Boolean> operatorState;
@Override
public void open(Configuration configuration) {
operatorState = this.getRuntimeContext().getState(descriptor);
}
@Override
public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
if (!operatorState.value()) {
// we haven't seen the element yet
out.collect(value);
// set operator state to true so that we don't emit elements with this key again
operatorState.update(true);
}
}
}
public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {
private OperatorState<Boolean> operatorState;
@Override
public void open(Configuration configuration) {
operatorState = this.getRuntimeContext().getKeyValueState("seen", Boolean.class, false);
}
@Override
public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
if (!operatorState.value()) {
// we haven't seen the element yet
out.collect(value);
operatorState.update(true);
}
}
}
input.keyBy(0, 1).timeWindow(Time.seconds(1)).apply(new WindowFunction<Iterable<Tuple3<String,Date,String>>, Tuple3<String, Date, String>, Tuple, TimeWindow>() {
@Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple3<String, Date, String>> input, Collector<Tuple3<String, Date, String>> out) throws Exception {
out.collect(input.iterator().next());
}
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With