Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to improve Mongoexport speed?

I have a 130M rows MongoDB 3.6.2.0 collection. It has several simple fields and 2 fields with nested JSON documents. Data is stored in compressed format (zlib).

I need to export one of embedded fields into JSON format as soon as possible. However, mongoexport is taking forever. After 12 hours of running it has processed only 5.5% of data, which is too slow for me.

The CPU is not busy. Mongoexport seems to be single-threaded.

Export command I am using:

mongoexport -c places --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json

It's actually getMore command which is unreasonably slow under the hood:

2018-05-02T17:59:35.605-0700 I COMMAND  [conn289] command maps.places command: getMore { getMore: 14338659261, collection: "places", $db: "maps" } originatingCommand: { find: "places", filter: {}, sort: {}, projection: { _id: 1, API: 1 }, skip: 0, snapshot: true, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:14338659261 keysExamined:0 docsExamined:5369 numYields:1337 nreturned:5369 reslen:16773797 locks:{ Global: { acquireCount: { r: 2676 } }, Database: { acquireCount: { r: 1338 } }, Collection: { acquireCount: { r: 1338 } } } protocol:op_query 22796ms

I have tried running multiple commands with --SKIP and --LIMIT options in separate processes like this

mongoexport -c places --SKIP 10000000 --LIMIT 10000000 --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json
mongoexport -c places --SKIP 20000000 --LIMIT 10000000 --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json

etc. But I was not able to finish waiting till the command with first non-zero SKIP even starts!

I have also tried with --forceTableScan option, which did not make any difference.

I have no indexes on places table.

My storage configuration:

journal.enabled: false
wiredTiger.collectionConfig.blockCompressor: zlib

Collection stats:

'ns': 'maps.places',
'size': 2360965435671,
'count': 130084054,
'avgObjSize': 18149,
'storageSize': 585095348224.0

My server specs:

Windows Server 2012 R2 x64
10Gb RAM 4TB HDD 6 cores Xeon 2.2Ghz

I've run a test and with SSD it's having the same terrible read throughput as with HDD.

My question:

Why is reading so slow? Has anyone else experienced the same issue? Can you give me any hints on how to speed up data dumping?

Update

I moved the DB to fast NVME SSD drives and I think now I can state my concerns about MongoDB read performance in a more clear way.

Why does this command, which seeks to find a chunk of documents not having specific field:

2018-05-05T07:20:46.215+0000 I COMMAND  [conn704] command maps.places command: find { find: "places", filter: { HTML: { $exists: false }, API.url: { $exists: true } }, skip: 9990, limit: 1600, lsid: { id: UUID("ddb8b02c-6481-45b9-9f84-cbafa586dbbf") }, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:15881327065 keysExamined:0 docsExamined:482851 numYields:10857 nreturned:101 reslen:322532 locks:{ Global: { acquireCount: { r: 21716 } }, Database: { acquireCount: { r: 10858 } }, Collection: { acquireCount: { r: 10858 } } } protocol:op_query 177040ms

only yields 50Mb/sec read pressure onto a fast flash drive? This is clearly performance of a single-threaded random (scattered) read. Whereas I have just proven that the drive allows 1Gb/sec read/write throughput easily.

In terms of Mongo internals, would it not be wiser to read BSON file in a sequential order and gain 20x scanning speed improvement? (And, since my blocks are zlib compressed, and server has 16 cores, better to decode fetched chunks in one or several helper threads?) Instead of iterating BSON document after document.

I also can confirm, even when I'm not specifying any query filters, and clearly want to iterate ENTIRE collection, fast sequential read of the BSON file is not happening.

like image 740
Anatoly Alekseev Avatar asked May 03 '18 01:05

Anatoly Alekseev


3 Answers

There are many factors that are limiting the export performance.

  • The data size is relatively large compared to available memory: ~2 TB vs. ~5 GB WiredTiger cache (if set to default). That is:
    • The whole WiredTiger cache can only contain at best ~0.22% of the collection, in reality it's very likely much less than this since the cache would contain data from other collections and indexes.
    • This means that WiredTiger needs to fetch from disk very frequently, while evicting the current content of the cache. If the replica set is being actively used, this would mean evicting "dirty" data from the cache and persisting them to disk, which would take time.
    • Note that documents inside the WiredTiger cache are not compressed.
  • The collection contains large documents, of which you need only one part of it. This means that extra time is required to process the documents.
  • The collection is compressed with zlib, meaning that extra time must be used to uncompress the documents.
  • The readPreference is secondaryPreferred, meaning that it will try to read from a secondary. If the replica set is being actively written to, oplog apply operations on the secondary will block readers. This will add further delay.

One possible improvement is that if this is an operation that you do frequently, creating an index on the relevant fields and exporting it using a covered query could improve performance since the index would be smaller than the full documents.

Edit: Running mongoexport in parallel may be helpful in this case:

Further from the additional information provided, I ran a test that seems to alleviate this issue somewhat.

It seems that running mongoexport in parallel, where each mongoexport handling a subset of the collection might be able to speed up the export.

To do this, divide the _id namespace corresponding to the number of mongoexport process you plan to run.

For example, if I have 200,000 documents, starting with _id:0 to _id:199,999 and using 2 mongoexport processes:

mongoexport -q '{"_id":{"$gte":0, "$lt":100000}}' -d test -c test > out1.json &
mongoexport -q '{"_id":{"$gte":100000, "$lt":200000}}' -d test -c test > out2.json &

where in the above example, the two mongoexport processes are each handling one half of the collection.

Testing this workflow with 1 process, 2 processes, 4 processes, and 8 processes I arrive at the following timings:

Using 1 process:

real    0m32.720s
user    0m33.900s
sys 0m0.540s

2 processes:

real    0m16.528s
user    0m17.068s
sys 0m0.300s

4 processes:

real    0m8.441s
user    0m8.644s
sys 0m0.140s

8 processes:

real    0m5.069s
user    0m4.520s
sys 0m0.364s

Depending on the available resources, running 8 mongoexport processes in parallel seems to speed up the process by a factor of ~6. This was tested in a machine with 8 cores.

Note: halfer's answer is similar in idea, although this answer basically tries to see if there's any benefit in calling mongoexport in parallel.

like image 80
kevinadi Avatar answered Nov 10 '22 02:11

kevinadi


You can try using the pandas and joblib library to export to the JSON file in parts. You can refer to this gist for processing the data in MongoDB.

from pandas import DataFrame
from joblib import Parallel,delayed

def process(idx,cursor):
    file_name = "fileName"+str(idx)+".json"
    df = DataFrame(list(cursor))
    df.to_json(file_name, orient='records')

#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(no_of_parts_of_collection)
Parallel(n_jobs=4)(delayed(process)(idx,cursor) for idx,cursor in enumerate(cursors)

The n_jobs parameter shall spawn processes equal to the number specified. Each process shall hold one core. I used 4 since your server has 6 cores available. The parallel_scan() api takes a number and divides the collection into the parts equal to the number provided. You can try higher numbers to break the collection into evenly divided cursors.

I have tried a similar approach but the signature of and definition of my process function was different. I was able to process 2.5M records in under 20 mins. You can read this answer of mine to get an idea as to what exactly I was trying to achieve.

like image 29
Yayati Sule Avatar answered Nov 10 '22 01:11

Yayati Sule


I don't work with Mongo, but a common trick could be used: make a simple app that efficiently sequentially queries all the data, filter it and save in the format you want.

If you need to save in a complex format and there are no libraries to work with it (I really doubt that), it might still be efficient to read all, filter, put it back in a temporary collection, export that collection fully, drop temporary collection.

like image 30
Imaskar Avatar answered Nov 10 '22 03:11

Imaskar