Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explain Kinesis Shard Iterator - AWS Java SDK

OK, I'll start with an elaborated use-case and will explain my question:

  1. I use a 3rd party web analytics platform which utilizes AWS Kinesis streams in order to pass data from the client into the final destination - a Kinesis stream;
  2. The web analytics platform uses 2 streams:
    1. A data collector stream (single shard stream);
    2. A second stream to enrich the raw data from the collector stream (single shard stream); Most importantly, this stream consumes the raw data from the first stream using TRIM_HORIZON iterator type;
  3. I consume the data from the stream using AWS Java SDK, secifically using the GetShardIteratorRequest class;
  4. I'm currently developing the extraction class, so this is done synchronously, meaning I consume data only when I compile my class;
  5. The class surprisingly works, although there are some things that I fail to understand, specifically with respect to how the data is consumed from the stream and the meaning of each one of iterator types;

My problem is that the data I retrieve is inconsistent and has no chronological logic in it.

  • When I use AT_SEQUENCE_NUMBER and provide the first sequence number from the shard with

    .getSequenceNumberRange().getStartingSequenceNumber();

    ... as the ``, I'm not getting all records. Similarly, AFTER_SEQUENCE_NUMBER;

  • When I use LATEST, I'm getting zero results;
  • When I use TRIM_HORIZON, which should make sense to use, it doesn't seem to be working fine. It used to provide me the data, and then I've added new "events" (records to the final stream) and I received zero records. Mystery.

My questions are:

  1. How can I safely consume data from the stream, without having to worry about missed records?
  2. Is there an alternative to the ShardIteratorRequest?
  3. If there is, how can I just "browse" the stream and see what's inside it for debugging references?
  4. What am I missing with the TRIM_HORIZON method?

Thanks in advance, I'd really love to learn a bit more about data consumption from a Kinesis stream.

like image 903
Yuval Herziger Avatar asked Sep 17 '14 12:09

Yuval Herziger


1 Answers

I understand the confusion above, and I had the same issues, but I think I've figured it out now. Note that I am using the JSON API directly without KCL.

I seems that the API gives clients 2 basic choices of iterators when they begin consuming a stream :

A) TRIM_HORIZON: for reading PAST records delayed between many minutes (even hours) and 24 hours old. It doesn't return recently put records. Using AFTER_SEQUENCE_NUMBER on the last record seen by this iterator returns an empty array even when records have been recently PUT.

B) LATEST: for reading FUTURE records in real time (immediately after they are PUT). I was tricked by the only sentence of documentation I could find on this "Start reading just after the most recent record in the shard, so that you always read the most recent data in the shard." You were getting an empty array because no records had been PUT since getting the iterator. If you get this type of iterator, and then PUT a record, that record will be immediately available.

Lastly, if you know the sequence id of a recently put record, you can get it immediately using AT_SEQUENCE_NUMBER, and you can get later records using AFTER_SEQUENCE_NUMBER even though they wont appear to a TRIM_HORIZON iterator.

The above does mean that if you want to read all known past records and future records in real time, you have to use a combination of A and B, with logic to cope with the records in between (the recent past). The KCL may well smooth over this.

like image 109
Buzzware Avatar answered Oct 18 '22 11:10

Buzzware