Concatenate s3 files when using AWS Firehose

Tags:

I have an AWS Kinesis Firehose stream putting data in s3 with the following config:

S3 buffer size (MB)*       2
S3 buffer interval (sec)*  60

Everything works fine. The only problem is that Firehose creates one s3 file for every chunk of data. (In my case, one file every minute, as in the screenshot). Over time, this is a lot of files: 1440 files per day, 525k files per year.

enter image description here

This is hard to manage (for example if I want to copy the bucket to another one I would need to copy every single file one by one and this would take time).

Two questions:

Is there a way to tell Kinesis to group/concatenate old files together. (Eg, files older than 24 hours are grouped into chunks one one day).
How is COPY redshift performance affected when COPYing from a plethora of s3 files versus just a few ? I haven't measured this precisely, but in my experience performance with a lot of small files is quite worse. From what I can recall, when using big files, a COPY of about 2M rows is about ~1minute. 2M rows with lots of small files (~11k files), it takes up to 30minutes.

My two main concerns are:

Better redshift COPY performance (from s3)
Easier overall s3 file management (backup, manipulation of any kind)

896

asked Apr 28 '16 17:04

Benjamin Crouzier

2 Answers

The easiest fix for you is going to be to increase the firehose buffer size and time limit - you can go up to 15 minutes which will cut your 1440 files per day down to 96 files a day (unless you hit the file size limit of course).

Beyond that, there is nothing in Kinesis that will concat the files for you, but you could setup an S3 lifecycle event that fires each time a new Kinesis file is created and add some of your own code to (maybe running on EC2 or go serverless with Lambda) and do the concatenation yourself.

Can't comment on the redshift loading performance, but I suspect it's not a huge deal, if it was - or will become one, I suspect AWS will do something about the performance since this is the usage pattern they setup.

answered Sep 27 '22 18:09

E.J. Brennan

Kinesis Firehose is designed to allow near real time processing of events. It is optimized for such use cases, and therefore you have such setting as smaller and more frequent files. This way you will get the data faster for queries in Redshift, or more frequent invocations of Lambda functions on the smaller files.

It is very common for customers of the service to also prepare the data for longer historical queries. Even if it is possible to run these long term queries on Redshift, it might make sense to use EMR for these queries. You can then keep your Redshift cluster tuned for the more popular recent events (for example, a "hot" cluster for 3 months on SSD, and "cold" cluster for 1 year on HDD).

It make sense that you will take the smaller (uncompressed?) files in the Firehose output S3 bucket, and transfer them to a more EMR (Hadoop/Spark/Presto) optimized format. You can use services such as S3DistCp, or a similar function that will take the smaller file, concatenate them and transform their format to a Parquet format.

Regarding the optimization for the Redshift COPY, there is a balance between the time that you aggregate the events and the time that it takes to COPY them. It is true that it is better to have larger files when you copy to Redshift, as there is a small overhead for each file. But on the other hand, if you are COPYing the data only every 15 minutes, you might have "quiet" times that you are not utilizing the network or the ability of the clusters to ingest events between these COPY commands. You should find the balance that is good for the business (how fresh do you need your events to be) and the technical aspects (how many events can you ingest in an hour/day to your Redshift).

answered Sep 27 '22 19:09

Guy

Related questions
                            
                                AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
                            
                                AWS Elastic Beanstalk - How to build bundle JS using npm and webpack
                            
                                aws cli copy between S3 regions on EC2
                            
                                Kinesis Firehose putting JSON objects in S3 without seperator comma
                            
                                Lambda aliases and CloudFront: The function ARN must reference a specific function version
                            
                                memcached-session-manager on AWS
                            
                                Distinguish bounce and OOTO with Amazon SES
                            
                                Building Erlang applications for the cloud
                            
                                AWS Elastic Beanstalk - Request Entity Too Large (413)
                            
                                Give an instance only access to tag itself?
                            
                                AWS Load Balancer EC2 health check request timed out failure
                            
                                How do I use an AWS SessionToken to read from S3 in pyspark?
                            
                                What is the best way to run Map/Reduce stuff on data from Mongo?
                            
                                Connecting AWS EC2 instance asks for password although PEM file is provided [closed]
                            
                                Writing bytes stream to s3 using python
                            
                                AWS Cloudfront and ELB Security Groups
                            
                                Joining 2 large postgres tables using int8range not scaling well
                            
                                Access HTTP request (headers, query string, cookies, body) object in lambda with http endpoint
                            
                                Unable to connect on AWS - RDS DB : SQL Server 2012 Express
                            
                                How to attach Elastic IP to EC2 instance during bootstrapping in aws CLI?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Concatenate s3 files when using AWS Firehose

Tags:

amazon-web-services

amazon-s3

amazon-kinesis

amazon-kinesis-firehose

amazon-redshift

Benjamin Crouzier

People also ask

2 Answers

E.J. Brennan

Guy

Recent Activity

Donate For Us