Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storage for millions of images [closed]

I need to prepare a storage for hundreds of millions of images (now I have 70 millions and this number is still growing). Each image has approx. 20kB. Of course I can store them in a filesystem, but I'm affraid of number of inodes. I have tested MongoDB and Cassandra. Both of them have disadvantages (I have limited HDD resources):

  • MongoDB - disk space consumption is 3 times larger than size of raw data
  • Cassandra - disk space consumption is similar to size of raw data but Cassandra needs a lot of free space for compaction procedure

Anybody can suggest proper solution for this kind of problem ?

like image 492
Tomasz Mielcarz Avatar asked Nov 19 '12 16:11

Tomasz Mielcarz


2 Answers

I have, in my life, done video distribution with both S3 (Rackspace cloudfiles included) and MongoDB.

Most people, without a second glance, would go for S3 however I have found that both have their downsides. One of the big problems is that S3 is not a CDN, it is actually a redundant storage within a specific region that is not replicated to other S3 regions, this means you will need to use something like cloudfront on top of S3 to ping your images to a sort of cache if you were to get serious load on your site.

S3 also has other features which makes it less CDN-ish and more of a storage warehouse. That being said, for infrequently accessed files S3 is blazingly fast.

This dual layer of course creates complexities such as maintenance. Not only that but a CDN will work upon TTLs and even though many CDNs now-a-days have edge purge abilities they still are not a 100% sure way of making sure your files are not accessible.

So due to the set-up and the accesses (possible accesses of files that should be deleted as well) this could get quite costly quite quickly.

This is where MongoDB could win. MongoDB could, depending on your scenario, actually be cheaper here due to the fact that you could use a whole bunch of micro instances on AWS to actually hold your information in, adding spot instance reservation to these instances (dirt cheap) and all you need is a big disk on a single machine.

Hell, you could even use S3 to store the images and then MongoDB as a cloudfront replacement.

When you want to ping images to different regions you just make a few spot instances in that target region and get MongoDB to replicate it's data across. You can do some kool stuff with the replication too to make sure only frequently accessed files from that region are placed in that region.

So I wouldn't throw MongoDB out (or even Cassandra), rather I would do a means test between the two.

Edit

As an added note about S3 pricing, if you store your files in RR (Reduced Redundancy) then the price halves (about) which makes S3 very cheap, however, you still have the problem that S3 is not a CDN.

Further Edit

Since I really only carried on from @cirrus' answer I will actually re-evaluate your question which is kinda answered above.

As an example, Youtube actually stores all of their images on single computers that are then distributed, so they can easily manage 200m thumbnails and...well...a lot of views each day easily from the file system. So I think your worry about the file system is over-rated.

As for which database is better...I dunno, that comes down to your testing.

I mean the answer to your problem depends upon your scenario and your budget and your hardware and your resources, i.e. if you has AWS servers this would be a whole different answer than dedicated in house servers.

like image 200
Sammaye Avatar answered Oct 18 '22 19:10

Sammaye


Why don't you stick them in Amazon's S3 or Azure Blob storage? They're a much better fit and you won't have space or memory issues, and you won't have to manage the deployment.

like image 30
cirrus Avatar answered Oct 18 '22 17:10

cirrus