MongoDB as file storage

Tags:

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.

I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.

And now the questions:

What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
Maybe there are some other solutions that can solve my problem more efficiently?

Thanks.

871

asked Feb 22 '13 18:02

cmd

2 Answers

I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.

The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/ ).

This means that querying for any part of the files or chunks collection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set ) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).

Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the files collection which only houses the files meta data allowing you to later serve the file itself from the chunks collection with a single query.

However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the files collection and its subsequent chunks collection into your working set.

With that in mind we have already hit the first snag:

Will files from gridfs be cached in ram and how it will affect read-write perfomance?

The read performance of small files could be awesome, directly from RAM; the writes would be just as good.

For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongod instance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.

The only way around this is to starting putting a single file across many shards :\.

Note: one more thing to consider is that the default average size of a chunks "chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.

What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)

GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?

That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.

Maybe there are some other solutions that can solve my problem more efficiently?

I personally have found that S3 (as @mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.

Hopefully I have been clear, hope it helps.

Edit: Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.

answered Sep 23 '22 03:09

Sammaye

Have you considered saving meta data onto MongoDB and writing actual files to Amazon S3? Both have excellent drivers and the latter is highly redundant, cloud/cdn-ready file storage. I would give it a shot.

answered Sep 19 '22 03:09

mluggy

Related questions
                            
                                Mongo field A greater than field B
                            
                                MongoDB: aggregate $project add field with static value
                            
                                MongoDB return True if document exists
                            
                                Mongoose-based app architecture
                            
                                MongoDB C# Query for 'Like' on string
                            
                                MongoDB: Fatal error: Class 'MongoClient' not found
                            
                                using mongodump with mongodb atlas
                            
                                Connecting to MongoDB database on mLab fails authentication
                            
                                How can I use a regex variable in a query for MongoDB
                            
                                Mock/Test Mongodb Database Node.js
                            
                                Django with Pluggable MongoDB Storage troubles
                            
                                Spatial data with mongodb or cassandra
                            
                                Preventing database-related race conditions in Node.js
                            
                                Why and when is necessary to rebuild indexes in MongoDB?
                            
                                Understanding mongo db explain
                            
                                MongoDB BSON codec not being used while encoding object
                            
                                How to mock mongodb for python unittests?
                            
                                Mongodb background indexes - are they still background once created?
                            
                                Portable MongoDB? [closed]
                            
                                Stop: Unknown instance mongodb (Ubuntu)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MongoDB as file storage

Tags:

mongodb

storage

gridfs

bigdata

cmd

People also ask

2 Answers

Sammaye

mluggy

Recent Activity

Donate For Us