I have around 2 million strings with different lengths that I need to compress and put into MongoDb GridFS as files.
The strings are currently stored in MS SQL TEXT field of a table. I wrote a sample app to read each row, compress it and store it as a GridFS file.
There is one reader and a thread pool of 50 threads storing the results. It works but it is very slow (100 records per second on average).
I was wondering if there is any way for faster import into GridFS?
I'm using MongoDb 1.6 on Windows with MongoCSharp driver in C# and .NET.
I think I found the issue inside MongoDb CSharp driver by profiling it while running a very simple app that puts 1000 strings into 1000 GridFS files.
It turns out that 97% of the time is spent on checking if a file with the same filename exists in the collection. I added an index on the filename field and it's now blazing fast!
The question for me is if the driver needs to keep the filename unique and does a check, why doesn't it add a unique index to it if that's missing? What's the reason behind that?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With