Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest and most efficient way of storing and fetching images when you have millions of users on a LAMP server?

Here is the best method I have come up with so far and I would like to know if there is an even better method (I'm sure there is!) for storing and fetching millions of user images:

In order to keep the directory sizes down and avoid having to make any additional calls to the DB, I am using nested directories that are calculated based on the User's unique ID as follows:

$firstDir = './images';
$secondDir = floor($userID / 100000);
$thirdDir = floor(substr($id, -5, 5) / 100);
$fourthDir = $userID;
$imgLocation = "$firstDir/$secondDir/$thirdDir/$fourthDir/1.jpg";

User ID's ($userID) range from 1 to the millions.

So if I have User ID 7654321, for example, that user's first pic will be stored in:

./images/76/543/7654321/1.jpg

For User ID 654321:

./images/6/543/654321/1.jpg

For User ID 54321 it would be:

./images/0/543/54321/1.jpg

For User ID 4321 it would be:

./images/0/43/4321/1.jpg

For User ID 321 it would be:

./images/0/3/321/1.jpg

For User ID 21 it would be:

./images/0/0/21/1.jpg

For User ID 1 it would be:

./images/0/0/1/1.jpg

This ensures that with up to 100,000,000 users, I will never have a directory with more than 1,000 sub-directories, so it seems to keep things clean and efficient.

I benchmarked this method against using the following "hash" method that uses the fastest hash method available in PHP (crc32). This "hash" method calculates the Second Directory as the first 3 characters in the hash of the User ID and the Third Directory as the next 3 character in order to distribute the files randomly but evenly as follows:

$hash = crc32($userID);
$firstDir = './images';
$secondDir = substr($hash,0,3);
$thirdDir = substr($hash,3,3);
$fourthDir = $userID;
$imgLocation = "$firstDir/$secondDir/$thirdDir/$fourthDir/1.jpg";

However, this "hash" method is slower than the method I described earlier above, so it's no good.

I then went one step further and found an even faster method of calculating the Third Directory in my original example (floor(substr($userID, -5, 5) / 100);) as follows:

$thirdDir = floor(substr($userID, -5, 3));

Now, this changes how/where the first 10,000 User ID's are stored, making some third directories have either 1 user sub-directory or 111 instead of 100, but it has the advantage of being faster since we do not have to divide by 100, so I think it is worth it in the long-run.

Once the directory structure is defined, here is how I plan on storing the actual individual images: if a user uploads a 2nd pic, for example, it would go in the same directory as their first pic, but it would be named 2.jpg. The default pic of the user would always just be 1.jpg, so if they decide to make their 2nd pic the default pic, 2.jpg would be renamed to 1.jpg and 1.jpg would be renamed 2.jpg.

Last but not least, if I needed to store multiple sizes of the same image, I would store them as follows for User ID 1 (for example):

1024px:

./images/0/0/1/1024/1.jpg
./images/0/0/1/1024/2.jpg

640px:

./images/0/0/1/640/1.jpg
./images/0/0/1/640/2.jpg

That's about it.

So, are there any flaws with this method? If so, could you please point them out?

Is there a better method? If so, could you please describe it?

Before I embark on implementing this, I want to make sure I have the best, fastest, and most efficient method for storing and retrieving images so that I don't have to change it again.

Thanks!

like image 508
ProgrammerGirl Avatar asked Jul 29 '11 19:07

ProgrammerGirl


1 Answers

Do not care about the small speed differences of calculting the path, it doesn't matter. What matters is how well and uniformly the images are distributed in the directories, how short is generated the path, how hard is it to deduce the naming convention (lets replace 1.jpg to 2.jpg.. wow, it's working..).

For example in your hash solution the path is entirely based on userid, which will put all pictures belonging to one user to the same directory.

Use the whole alphabet (lower and uppercase, if your FS supports it), not just numbers. Check what other softwares do, a good place to check hashed directy names is google chrome, mozilla, ... It's better to have short directory names. Faster to look up, occupies less space in your html documents.

like image 138
Karoly Horvath Avatar answered Sep 29 '22 11:09

Karoly Horvath