Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concatenate s3 files when using AWS Firehose

I have an AWS Kinesis Firehose stream putting data in s3 with the following config:

S3 buffer size (MB)*       2
S3 buffer interval (sec)*  60

Everything works fine. The only problem is that Firehose creates one s3 file for every chunk of data. (In my case, one file every minute, as in the screenshot). Over time, this is a lot of files: 1440 files per day, 525k files per year.

enter image description here

This is hard to manage (for example if I want to copy the bucket to another one I would need to copy every single file one by one and this would take time).

Two questions:

  • Is there a way to tell Kinesis to group/concatenate old files together. (Eg, files older than 24 hours are grouped into chunks one one day).
  • How is COPY redshift performance affected when COPYing from a plethora of s3 files versus just a few ? I haven't measured this precisely, but in my experience performance with a lot of small files is quite worse. From what I can recall, when using big files, a COPY of about 2M rows is about ~1minute. 2M rows with lots of small files (~11k files), it takes up to 30minutes.

My two main concerns are:

  • Better redshift COPY performance (from s3)
  • Easier overall s3 file management (backup, manipulation of any kind)
like image 896
Benjamin Crouzier Avatar asked Apr 28 '16 17:04

Benjamin Crouzier


People also ask

Can Kinesis firehose read from S3?

Yes, Kinesis Data Firehose can back up all un-transformed records to your S3 bucket concurrently while delivering transformed records to destination. Source record backup can be enabled when you create or update your delivery stream.

What is dynamic partitioning in Firehose?

Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id ) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.

What is the difference between Kinesis data streams and Firehose?

Data Streams is a low latency streaming service in AWS Kinesis with the facility for ingesting at scale. On the other hand, Kinesis Firehose aims to serve as a data transfer service. The primary purpose of Kinesis Firehose focuses on loading streaming data to Amazon S3, Splunk, ElasticSearch, and RedShift.

Which data formats are supported for Firehose?

Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON.


2 Answers

The easiest fix for you is going to be to increase the firehose buffer size and time limit - you can go up to 15 minutes which will cut your 1440 files per day down to 96 files a day (unless you hit the file size limit of course).

Beyond that, there is nothing in Kinesis that will concat the files for you, but you could setup an S3 lifecycle event that fires each time a new Kinesis file is created and add some of your own code to (maybe running on EC2 or go serverless with Lambda) and do the concatenation yourself.

Can't comment on the redshift loading performance, but I suspect it's not a huge deal, if it was - or will become one, I suspect AWS will do something about the performance since this is the usage pattern they setup.

like image 96
E.J. Brennan Avatar answered Sep 27 '22 18:09

E.J. Brennan


Kinesis Firehose is designed to allow near real time processing of events. It is optimized for such use cases, and therefore you have such setting as smaller and more frequent files. This way you will get the data faster for queries in Redshift, or more frequent invocations of Lambda functions on the smaller files.

It is very common for customers of the service to also prepare the data for longer historical queries. Even if it is possible to run these long term queries on Redshift, it might make sense to use EMR for these queries. You can then keep your Redshift cluster tuned for the more popular recent events (for example, a "hot" cluster for 3 months on SSD, and "cold" cluster for 1 year on HDD).

It make sense that you will take the smaller (uncompressed?) files in the Firehose output S3 bucket, and transfer them to a more EMR (Hadoop/Spark/Presto) optimized format. You can use services such as S3DistCp, or a similar function that will take the smaller file, concatenate them and transform their format to a Parquet format.

Regarding the optimization for the Redshift COPY, there is a balance between the time that you aggregate the events and the time that it takes to COPY them. It is true that it is better to have larger files when you copy to Redshift, as there is a small overhead for each file. But on the other hand, if you are COPYing the data only every 15 minutes, you might have "quiet" times that you are not utilizing the network or the ability of the clusters to ingest events between these COPY commands. You should find the balance that is good for the business (how fresh do you need your events to be) and the technical aspects (how many events can you ingest in an hour/day to your Redshift).

like image 21
Guy Avatar answered Sep 27 '22 19:09

Guy