Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing many small files (on S3)?

I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time. Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed.

This means each day I'm storing an additional 200-300GB worth of files.

Now, Standard storage costs $0.023 per GB and $0.004 for Glacier.

While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier:

PUT requests to Glacier $0.05 per 1,000 requests

Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests

Is there a way of gluing the files together, but keeping them accessible individually?

like image 591
Buffalo Avatar asked Sep 27 '19 07:09

Buffalo


People also ask

Is S3 good for small files?

Small Files Create Too Much Latency For Data Analytics Since streaming data comes in small files, typically you write these files to S3 rather than combine them on write. But small files impede performance. This is true regardless of whether you're working with Hadoop or Spark, in the cloud or on-premises.

How many files can an S3 hold?

Q: How much data can I store in Amazon S3? The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB. The largest object that can be uploaded in a single PUT is 5 GB.

What is S3 file storage?

What is Amazon S3? Amazon Simple Storage Service (Amazon S3) is a scalable, high-speed, web-based cloud storage service. The service is designed for online backup and archiving of data and applications on Amazon Web Services (AWS).

Can files be stored in S3?

Amazon S3 is a service that enables you to store your data (referred to as objects) at massive scale. In this guide, you will create an Amazon S3 bucket (a container for data stored in S3), upload a file, retrieve the file, and delete the file.


1 Answers

An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use S3 Standard – Infrequent Access (0,0125 USD per GB with millisecond access) instead of S3 Standard. And maybe for some really not using data Glacier. But it still depends on how fast do you need that data.

Having that I'd suggest following:

  • as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;
  • make some index file or database to know where each html-file is stored;
  • read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.
like image 85
wowkin2 Avatar answered Sep 19 '22 10:09

wowkin2