I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time. Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed. This means each day I'm storing an additional 200-300GB worth of files. Now, Standard storage costs $0.023 per GB and $0.004 for Glacier. While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier: <code>PUT requests to Glacier $0.05 per 1,000 requests</code> <code>Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests</code> Is there a way of gluing the files together, but keeping them accessible individually?

An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use <code>S3 Standard – Infrequent Access</code> (0,0125 USD per GB with millisecond access) instead of <code>S3 Standard</code>. And maybe for some really not using data <code>Glacier</code>. But it still depends on how fast do you need that data. Having that I'd suggest following: <ul> <li>as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;</li> <li> make some index file or database to know where each html-file is stored;</li> <li>read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.</li> </ul>

Storing many small files (on S3)?

Tags:

amazon-web-services

amazon-s3

I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time. Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed.

This means each day I'm storing an additional 200-300GB worth of files.

Now, Standard storage costs $0.023 per GB and $0.004 for Glacier.

While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier:

PUT requests to Glacier $0.05 per 1,000 requests

Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests

Is there a way of gluing the files together, but keeping them accessible individually?

591

asked Sep 27 '19 07:09

Buffalo

1 Answers

An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use S3 Standard – Infrequent Access (0,0125 USD per GB with millisecond access) instead of S3 Standard. And maybe for some really not using data Glacier. But it still depends on how fast do you need that data.

Having that I'd suggest following:

as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;
make some index file or database to know where each html-file is stored;
read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.

answered Sep 19 '22 10:09

wowkin2

Related questions
                            
                                boto3 copy vs copy_object regarding file permission ACL in s3
                            
                                How to ignore idle timeout from AWS ELB in the browser
                            
                                Accessing Parameter Store from VPC / Lambda
                            
                                AWS SQS FIFO - How to get more than 10 messages at a time?
                            
                                Terraform AWS S3 to Lambda Notification Trigger
                            
                                AWS API Gateway Custom Authorizer not invoked
                            
                                an internal error occurred during: uploading code to lambda
                            
                                Identifying and deleting S3 Objects that are not being accessed?
                            
                                AWS CloudFormation Script Fails - Cognito is not allowed to use your email identity
                            
                                aws CAPABILITY_AUTO_EXPAND console web codepipeline with cloudformation
                            
                                Django on AWS Elastic Beanstalk - No module named MySQLdb Error
                            
                                AWS Step cannot correctly invoke AWS Batch job with complex parameters
                            
                                Kubernetes Kops without dns
                            
                                AWS::ApiGateway::Stage requires DeploymentId ... but where do I find this?
                            
                                How to run python code on AWS lambda with package dependencies >500MB?
                            
                                AWS RDS IAM Authentication with Terraform
                            
                                AWS Sagemaker Ground Truth WorkerID for private team
                            
                                AWS update Athena meta: Glue Crawler vs MSCK Repair Table
                            
                                Creating presigned url for a S3 folder in python
                            
                                Jenkins suddenly started failing to provision agents in Amazon EKS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With