I'm documenting about the GridFS and the possibility to shard it among different machines.
Reading the documentation here, the suggested shard key is chunks.files_id. This key will be linked to the _id of the files collection, thus this _id is incremental. Every new file i save in the Grid will have a new incremental _id.
In the O'Reilly "Scaling MongoDB" book the use of an incremental shard key is discouraged to avoid HotSpots (the last shard will receive all the write and read).
what is your suggestion for sharding the GridFS collection?
have anybody experienced the HotSpot problem?
thank you.
GridFS is the MongoDB specification for storing and retrieving large files such as images, audio files, video files, etc. It is kind of a file system to store files but its data is stored within MongoDB collections. GridFS has the capability to store files even greater than its document size limit of 16MB.
Sharding is the process of distributing data across multiple hosts. In MongoDB, sharding is achieved by splitting large data sets into small data sets across multiple MongoDB instances.
In MongoDB, use GridFS for storing files larger than 16 MB. In some situations, storing large files may be more efficient in a MongoDB database than on a system-level filesystem. If your filesystem limits the number of files in a directory, you can use GridFS to store as many files as needed.
You create a GridFSBucket instance by calling its constructor: IMongoDatabase database; var bucket = new GridFSBucket(database);
You should shard on files_id
to keep file chunks together, but you are correct that that will create a hotspot. If you can, use something other than ObjectId for _id
s in the fs.files collection (probably MD5s would be better than ObjectIds).
We'll be adding hashing for sharding, which will solve this, but not until at least 2.0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With