Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way of storing many thumbnails

So currently I am storing all thumbnails in a single directory with file name as the md5 hash of the full path to the full size image. But I've read here that this causes issues when directory reaches thousands of files. They will be located slower and slower by the linux file system.

What alternatives do I have, considering I can only locate the thumbnail by the original image path? Dates would be the best options, like year/month/day/md5_hash.jpg, but that would require me to store and read the date from somewhere, so it would be add some extra steps.

I was thinking to split the md5, like first two characters = subfolder name, rest = file name. That would give me like 15*15 subfolders, but I'd like to hear better options, thanks!


Another idea I just got: create a separate server for organizing thumbnails. The server would keep track of thumbnail counts and create additional folders when a certain limit is reached and reuse old folders when thumbs are removed. Downside is that I need a separate db that maps hashes to thumbnail paths :(

like image 269
Alex Avatar asked Jul 08 '20 18:07

Alex


People also ask

Which is the best solution to store thumbnails?

Go to the thumbnails folder. Locate the subfolder, based on creation date of original image. In this case, it's 52019. Search for the file, using hash of the original image's full path.

How do you store pictures efficiently?

A scale-out file server is frequently the choice for saving image data as primary storage. Meanwhile, object storage, the public cloud and tape drives are all appropriate targets for data protection and archiving.

Which database is best for storing images?

Store in Couchbase a metadata JSON document for each object, maybe a small thumbnail image at most. In that document is data you need about that object in your application quickly, but also a pointer to a purpose built object store like S3, a file system or HDFS. You will get the best of all worlds.


2 Answers

The best, efficient, minimal and simplest method is SeaweedFS

Since 2017, I am using SeaweedFS to store about 4 million jpegs each 24 hours. Currently the DB holds over 2 billion records. I never had an issue with it at all and it saves a lot of disk space compared to storing as File-System files.

Below is the author Intro:

SeaweedFS is a simple and highly scalable distributed file system. There are two objectives:

  1. to store billions of files!
  2. to serve the files fast!

Details:

My project contains 2 images for each event, one is thumbnail and the other is full frame. At first phase of the project I stored the images as files with directory structure year/month/day/[thumb|full].jpg but after few days I had to browse through the files and it was nightmare and the disk response was slow. and in case of deleting large amount of files (over million) it would take hours. So I decided to do research on how big guys as google, facebook, instagram and twitter stores billions of images, and I found couple of youtube videos explains parts of the architectures, then I came across SeaweedFS and I gave it a try and I took quick look to the source code "release ver 0.76" and everything seems fine "no fishy code".
the only note was the logo fetched over CDN rather than locally.

The beauty of seaweedFS lies in its simplicity and stability, and it's kind of hidden gem (guess until now). Beside its ability to store billion of files and access them in flash of milliseconds, it auto purge the files based on TTL, that's very useful feature since most customers have finite amount of storage, hence they can't keep all the data forever. And the second thing I love about it is saving a lot of storage, example:

In my server each file was consuming Multiple of 8 KB from the disk space (due to File System structure), so even that most of my thumbnails had the size of 1 or 2 KB it consume 8 KB, so when you add up all that wasted bytes you end up wasting large percent of storage, in SeaWeedFS each file metadata take extra 40 bytes only, and that's a legacy!.

Hope that's help.

like image 125
Jawad Al Shaikh Avatar answered Nov 09 '22 18:11

Jawad Al Shaikh


We use FreeBSD (file system UFS), not Linux, so some details may be different.

Background

We have several million files on this system that need to be served as quick as possible from a website, for individual access. The system we have been using has worked very well over the last 16 years.

Server 1 (named: Tom) has the main user website with a fairly standard Apache set-up and a MySQL data base. Nothing special at all.

Server 2 (named: Jerry) is where the user files are stored and has been customised for speedy delivery of these small files.

Jerry's hard drive is tweaked during creation to make sure we do not run out of inodes - something you need to consider when creating millions of small files.

Jerry's Apache config is tweaked for very short connection times and single file access per connection. Without these tweaks, you will have open connections sitting there wasting resources. This Apache config would not suit the main system (Tom) at all and would cause a number of issues.

As you are serving "thumbnails", not individual requests, you might need a slightly different structure. To be honest, I do not know enough about your needs to really advise what would be best for your webserver config.

Historically, we used multiple SCSI drives across a number of servers. At the moment, we have a single server with 300MB/s drives. The business has been in decline for a while (thanks to Facebook), but we are still doing more than 2 million file request per day. At our peak it was more like 10 million per day.

Our structure (a possible answer)

Everything on Jerry is tweaked for the small file delivery and nothing else.

Jerry is a webserver, but we treat it more like a database. Everything that is not needed is removed.

Each file is given a 4 character ID. The ID is alpha-numeric (0-9,a-z,A-Z). This gives you 61*61*61*61 combinations (or 13,845,841 IDs).

We have multiple domains as well, so each domain has a maximum of 13,845,841 IDs. We got very close on the popular "domains" to this limit before Facebook came along and we had plans ready to go that would allow for 5 character IDs, but did not need it in the end.

File system look-ups are very fast if you know the full path to the file. It is only slow if you need to scan for file matches. We took full advantage of this.

Each 4 character id is a series of directories. for example, aBc9 is /path/to/a/B/c/9.

This is a very high number of unique IDs across only 4 directories. Each directory has a maximum of 61 sub-directories. Creating fast look-ups without flooding the file system index.

Located in directory ./9 (the last directory in the ID) are the necessary metadata files and the raw data file. The metadata is a known file name and so is the data file. We also have other known files in each folder, but you get the idea.

If a users is updating or checking the metadata, the ID is known so a request for the metadata is returned.

If the data file is requested, again, the ID is known, so the data is returned. No scanning or complex checking is performed.

If the ID is invalid, an invalid result is returned.

Nothing complex, everything for speed.

Our issues

When you are talking about millions of small files it is possible to run out of inodes. Be sure to factor this in to your disk creation for the server from the start. Plan ahead.

We disabled and / or edited a number of the FreeBSD system checks. The maintenance cronjobs are not designed for systems with so many files.

The Apache configure was a bit of trial and error to get it just right. When you do get it, the relief is huge. Apache's mod_status is very helpful.

The very first thing to do is disable all log files. Next, disable everything and re-add only what you need.

The code for the delivery (and saving) of the metadata and raw data is also very optimised. Forget code libraries. Every line of code has been checked and re-checked over the years for speed.

Conclusion

If you really do have a lot of thumbnails, split the system. Serve the small files from a dedicated server that has been optimised for that reason. Keep the main system tweaked for more standard usage.

A directory based ID system (be that random 4 characters or parts of an MD5) can be fast so long as you do not need to scan for files.

Your base operating system will need to be tweaked so the system checks are not sucking up your system resources.

Disable the webserver logfile creation. You are almost never going to need to it and it will create a bottleneck on the file system. If you need stats, you can get a general overview from mod_status.

To be very honest, not enough information is really known about your individual case and needs. I am unsure if any of my personal experience would be of help.

Good luck!

like image 6
Tigger Avatar answered Nov 09 '22 16:11

Tigger