I have a 130M rows MongoDB 3.6.2.0 collection. It has several simple fields and 2 fields with nested JSON documents. Data is stored in compressed format (zlib).
I need to export one of embedded fields into JSON format as soon as possible. However, mongoexport is taking forever. After 12 hours of running it has processed only 5.5% of data, which is too slow for me.
The CPU is not busy. Mongoexport seems to be single-threaded.
Export command I am using:
mongoexport -c places --fields API \
--uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
-o D:\APIRecords.json
It's actually getMore command which is unreasonably slow under the hood:
2018-05-02T17:59:35.605-0700 I COMMAND [conn289] command maps.places command: getMore { getMore: 14338659261, collection: "places", $db: "maps" } originatingCommand: { find: "places", filter: {}, sort: {}, projection: { _id: 1, API: 1 }, skip: 0, snapshot: true, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:14338659261 keysExamined:0 docsExamined:5369 numYields:1337 nreturned:5369 reslen:16773797 locks:{ Global: { acquireCount: { r: 2676 } }, Database: { acquireCount: { r: 1338 } }, Collection: { acquireCount: { r: 1338 } } } protocol:op_query 22796ms
I have tried running multiple commands with --SKIP
and --LIMIT
options in separate processes like this
mongoexport -c places --SKIP 10000000 --LIMIT 10000000 --fields API \
--uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
-o D:\APIRecords.json
mongoexport -c places --SKIP 20000000 --LIMIT 10000000 --fields API \
--uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
-o D:\APIRecords.json
etc. But I was not able to finish waiting till the command with first non-zero SKIP even starts!
I have also tried with --forceTableScan
option, which did not make any difference.
I have no indexes on places table.
My storage configuration:
journal.enabled: false
wiredTiger.collectionConfig.blockCompressor: zlib
Collection stats:
'ns': 'maps.places',
'size': 2360965435671,
'count': 130084054,
'avgObjSize': 18149,
'storageSize': 585095348224.0
My server specs:
Windows Server 2012 R2 x64
10Gb RAM 4TB HDD 6 cores Xeon 2.2Ghz
I've run a test and with SSD it's having the same terrible read throughput as with HDD.
My question:
Why is reading so slow? Has anyone else experienced the same issue? Can you give me any hints on how to speed up data dumping?
I moved the DB to fast NVME SSD drives and I think now I can state my concerns about MongoDB read performance in a more clear way.
Why does this command, which seeks to find a chunk of documents not having specific field:
2018-05-05T07:20:46.215+0000 I COMMAND [conn704] command maps.places command: find { find: "places", filter: { HTML: { $exists: false }, API.url: { $exists: true } }, skip: 9990, limit: 1600, lsid: { id: UUID("ddb8b02c-6481-45b9-9f84-cbafa586dbbf") }, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:15881327065 keysExamined:0 docsExamined:482851 numYields:10857 nreturned:101 reslen:322532 locks:{ Global: { acquireCount: { r: 21716 } }, Database: { acquireCount: { r: 10858 } }, Collection: { acquireCount: { r: 10858 } } } protocol:op_query 177040ms
only yields 50Mb/sec read pressure onto a fast flash drive? This is clearly performance of a single-threaded random (scattered) read. Whereas I have just proven that the drive allows 1Gb/sec read/write throughput easily.
In terms of Mongo internals, would it not be wiser to read BSON file in a sequential order and gain 20x scanning speed improvement? (And, since my blocks are zlib compressed, and server has 16 cores, better to decode fetched chunks in one or several helper threads?) Instead of iterating BSON document after document.
I also can confirm, even when I'm not specifying any query filters, and clearly want to iterate ENTIRE collection, fast sequential read of the BSON file is not happening.
There are many factors that are limiting the export performance.
secondaryPreferred
, meaning that it will try to read from a secondary. If the replica set is being actively written to, oplog apply operations on the secondary will block readers. This will add further delay.One possible improvement is that if this is an operation that you do frequently, creating an index on the relevant fields and exporting it using a covered query could improve performance since the index would be smaller than the full documents.
Edit: Running mongoexport
in parallel may be helpful in this case:
Further from the additional information provided, I ran a test that seems to alleviate this issue somewhat.
It seems that running mongoexport
in parallel, where each mongoexport
handling a subset of the collection might be able to speed up the export.
To do this, divide the _id
namespace corresponding to the number of mongoexport
process you plan to run.
For example, if I have 200,000 documents, starting with _id:0
to _id:199,999
and using 2 mongoexport
processes:
mongoexport -q '{"_id":{"$gte":0, "$lt":100000}}' -d test -c test > out1.json &
mongoexport -q '{"_id":{"$gte":100000, "$lt":200000}}' -d test -c test > out2.json &
where in the above example, the two mongoexport
processes are each handling one half of the collection.
Testing this workflow with 1 process, 2 processes, 4 processes, and 8 processes I arrive at the following timings:
Using 1 process:
real 0m32.720s
user 0m33.900s
sys 0m0.540s
2 processes:
real 0m16.528s
user 0m17.068s
sys 0m0.300s
4 processes:
real 0m8.441s
user 0m8.644s
sys 0m0.140s
8 processes:
real 0m5.069s
user 0m4.520s
sys 0m0.364s
Depending on the available resources, running 8 mongoexport
processes in parallel seems to speed up the process by a factor of ~6. This was tested in a machine with 8 cores.
Note: halfer's answer is similar in idea, although this answer basically tries to see if there's any benefit in calling mongoexport
in parallel.
You can try using the pandas and joblib library to export to the JSON file in parts. You can refer to this gist for processing the data in MongoDB.
from pandas import DataFrame
from joblib import Parallel,delayed
def process(idx,cursor):
file_name = "fileName"+str(idx)+".json"
df = DataFrame(list(cursor))
df.to_json(file_name, orient='records')
#make a list of cursors.. you can read the parallel_scan api of pymongo
cursors = mongo_collection.parallel_scan(no_of_parts_of_collection)
Parallel(n_jobs=4)(delayed(process)(idx,cursor) for idx,cursor in enumerate(cursors)
The n_jobs
parameter shall spawn processes equal to the number specified. Each process shall hold one core. I used 4 since your server has 6 cores available. The parallel_scan()
api takes a number and divides the collection into the parts equal to the number provided. You can try higher numbers to break the collection into evenly divided cursors.
I have tried a similar approach but the signature of and definition of my process
function was different. I was able to process 2.5M records in under 20 mins. You can read this answer of mine to get an idea as to what exactly I was trying to achieve.
I don't work with Mongo, but a common trick could be used: make a simple app that efficiently sequentially queries all the data, filter it and save in the format you want.
If you need to save in a complex format and there are no libraries to work with it (I really doubt that), it might still be efficient to read all, filter, put it back in a temporary collection, export that collection fully, drop temporary collection.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With