I cant seem to find a decent example that shows how can I consume an AWS Kinesis stream via Python. Can someone please provide me with some examples I could look into? Best

While this question has already been answered, it might be a good idea for future readers to consider using the <code>Kinesis Client Library (KCL) for Python</code> instead of using <code>boto</code> directly. It simplifies consuming from the stream when you have multiple consumer instances, and/or changing shard configurations. https://aws.amazon.com/blogs/aws/speak-to-kinesis-in-python/ A more complete enumeration of what the KCL provides <ul> <li>Connects to the stream</li> <li>Enumerates the shards</li> <li>Coordinates shard associations with other workers (if any)</li> <li>Instantiates a record processor for every shard it manages</li> <li>Pulls data records from the stream</li> <li>Pushes the records to the corresponding record processor</li> <li> Checkpoints processed records (it uses DynamoDB so your code doesn't have to manually persist the checkpoint value) </li> <li>Balances shard-worker associations when the worker instance count changes</li> <li>Balances shard-worker associations when shards are split or merged</li> </ul> The items in bold are the ones that I think are where the KCL really provides non-trivial value over boto. But depending on your usecase boto may be much much much simpler.

Consuming a kinesis stream in python

2 Answers

you should use boto.kinesis:

from boto import kinesis

After you created a stream:

step 1: connect to aws kinesis:

auth = {"aws_access_key_id":"id", "aws_secret_access_key":"key"}
connection = kinesis.connect_to_region('us-east-1',**auth)

step 2: get the stream info (like how many shards, if it is active ..)

tries = 0
while tries < 10:
    tries += 1
    time.sleep(1)
    try:
        response = connection.describe_stream('stream_name')   
        if response['StreamDescription']['StreamStatus'] == 'ACTIVE':
            break 
    except :
        logger.error('error while trying to describe kinesis stream : %s')
else:
    raise TimeoutError('Stream is still not active, aborting...')

step 3 : get all shard ids, and for each shared id get the shard iterator:

shard_ids = []
stream_name = None 
if response and 'StreamDescription' in response:
    stream_name = response['StreamDescription']['StreamName']                   
    for shard_id in response['StreamDescription']['Shards']:
         shard_id = shard_id['ShardId']
         shard_iterator = connection.get_shard_iterator(stream_name, shard_id, shard_iterator_type)
         shard_ids.append({'shard_id' : shard_id ,'shard_iterator' : shard_iterator['ShardIterator'] })

step 4 : read the data for each shard

limit is the limit of records that you want to receive. (you can receive up to 10 MB) shard_iterator is the shared from previous step.

tries = 0
result = []
while tries < 100:
     tries += 1
     response = connection.get_records(shard_iterator = shard_iterator , limit = limit)
     shard_iterator = response['NextShardIterator']
     if len(response['Records'])> 0:
          for res in response['Records']: 
               result.append(res['Data'])                  
          return result , shard_iterator

in your next call to get_records, you should use the shard_iterator that you received with the result of the previous get_records.

note: in one call to get_records, (limit = None) you can receive empty records. if calling to get_records with a limit, you will get the records that are in the same partition key (when you put data in to stream, you have to use partition key :

connection.put_record(stream_name, data, partition_key)

answered Oct 22 '22 15:10

Eyal Ch

While this question has already been answered, it might be a good idea for future readers to consider using the Kinesis Client Library (KCL) for Python instead of using boto directly. It simplifies consuming from the stream when you have multiple consumer instances, and/or changing shard configurations.

https://aws.amazon.com/blogs/aws/speak-to-kinesis-in-python/

A more complete enumeration of what the KCL provides

Connects to the stream
Enumerates the shards
Coordinates shard associations with other workers (if any)
Instantiates a record processor for every shard it manages
Pulls data records from the stream
Pushes the records to the corresponding record processor
Checkpoints processed records (it uses DynamoDB so your code doesn't have to manually persist the checkpoint value)
Balances shard-worker associations when the worker instance count changes
Balances shard-worker associations when shards are split or merged

The items in bold are the ones that I think are where the KCL really provides non-trivial value over boto. But depending on your usecase boto may be much much much simpler.

answered Oct 22 '22 15:10

jumand

Related questions
                            
                                How to create a list of dictionaries from a dictionary with lists of different lengths
                            
                                What are the advantages of packaging your python library/application as an .egg file?
                            
                                unicode() vs. str.decode() for a utf8 encoded byte string (python 2.x)
                            
                                Python: use regular expression to remove the white space from all lines
                            
                                Defining a model class in Django shell fails
                            
                                How to import 'GDB' in Python
                            
                                Python: Test if value can be converted to an int in a list comprehension
                            
                                Chi-Squared test in Python
                            
                                Why is x**3 slower than x*x*x? [duplicate]
                            
                                python - specifically handle file exists exception
                            
                                Is it correct to pass None to a parameter?
                            
                                re.findall behaves weird
                            
                                Python argparse - Mutually exclusive group with default if no argument is given
                            
                                PyCharm terminal doesn't activate conda environment
                            
                                How to chain Python function calls so the behaviour is as follows
                            
                                How do I unit test a module that relies on urllib2?
                            
                                How do you list all child processes in python?
                            
                                Loading external script with jinja2 template directive
                            
                                How to check if python module exists and can be imported [duplicate]
                            
                                How to handle "duck typing" in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Consuming a kinesis stream in python

Tags:

python

stream

amazon-web-services

boto

aliirz

People also ask

2 Answers

Eyal Ch

jumand

Recent Activity

Donate For Us