I need to prepare a storage for hundreds of millions of images (now I have 70 millions and this number is still growing). Each image has approx. 20kB. Of course I can store them in a filesystem, but I'm affraid of number of inodes. I have tested MongoDB and Cassandra. Both of them have disadvantages (I have limited HDD resources):
Anybody can suggest proper solution for this kind of problem ?
I have, in my life, done video distribution with both S3 (Rackspace cloudfiles included) and MongoDB.
Most people, without a second glance, would go for S3 however I have found that both have their downsides. One of the big problems is that S3 is not a CDN, it is actually a redundant storage within a specific region that is not replicated to other S3 regions, this means you will need to use something like cloudfront on top of S3 to ping your images to a sort of cache if you were to get serious load on your site.
S3 also has other features which makes it less CDN-ish and more of a storage warehouse. That being said, for infrequently accessed files S3 is blazingly fast.
This dual layer of course creates complexities such as maintenance. Not only that but a CDN will work upon TTLs and even though many CDNs now-a-days have edge purge abilities they still are not a 100% sure way of making sure your files are not accessible.
So due to the set-up and the accesses (possible accesses of files that should be deleted as well) this could get quite costly quite quickly.
This is where MongoDB could win. MongoDB could, depending on your scenario, actually be cheaper here due to the fact that you could use a whole bunch of micro instances on AWS to actually hold your information in, adding spot instance reservation to these instances (dirt cheap) and all you need is a big disk on a single machine.
Hell, you could even use S3 to store the images and then MongoDB as a cloudfront replacement.
When you want to ping images to different regions you just make a few spot instances in that target region and get MongoDB to replicate it's data across. You can do some kool stuff with the replication too to make sure only frequently accessed files from that region are placed in that region.
So I wouldn't throw MongoDB out (or even Cassandra), rather I would do a means test between the two.
As an added note about S3 pricing, if you store your files in RR (Reduced Redundancy) then the price halves (about) which makes S3 very cheap, however, you still have the problem that S3 is not a CDN.
Since I really only carried on from @cirrus' answer I will actually re-evaluate your question which is kinda answered above.
As an example, Youtube actually stores all of their images on single computers that are then distributed, so they can easily manage 200m thumbnails and...well...a lot of views each day easily from the file system. So I think your worry about the file system is over-rated.
As for which database is better...I dunno, that comes down to your testing.
I mean the answer to your problem depends upon your scenario and your budget and your hardware and your resources, i.e. if you has AWS servers this would be a whole different answer than dedicated in house servers.
Why don't you stick them in Amazon's S3 or Azure Blob storage? They're a much better fit and you won't have space or memory issues, and you won't have to manage the deployment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With