Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with concatenated Avro files?

I'm storing data generated from my web application in Apache Avro format. The data is encoded and sent to an Apache Kinesis Firehose that buffers and writes the data to Amazon S3 every 300 seconds or so. Since I have multiple web servers, this results in multiple blobs of Avro files being sent to Kinesis, upon which it concatenates and periodically writes them to S3.

When I grab the file from S3, I can't using the normal Avro tools to decode it since it's actually multiple files in one. I could add a delimiter I suppose, but that seems risky in the event that the data being logged also has the same delimiter.

What's the best way to deal with this? I couldn't find anything in the standard that supports multiple Avro files concatenated into the same file.

like image 841
Chris Miller Avatar asked Oct 19 '22 21:10

Chris Miller


1 Answers

Looks like currently firehose doesn't provide any support to handle your use case, but it's doable with regular kinesis stream.

Instead of sending to firehose, you send your data to a kinesis stream, you define your own AWS Lambda function (with kinesis event source), which reads the data from the stream and writes it to S3 as Avro file, here you won't face the problem firehose had, cause you already know it's avro format (and you probably own the schema), so it's up to you to decode/encode it properly (and write the file to S3 at once)

like image 164
alexanderlz Avatar answered Oct 22 '22 07:10

alexanderlz