Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

inserting millions of documents - mongo / pymongo - insert_many

New to mongo/pymongo. Currently using the latest - v3.2.2

It looks as if insert_many is not performing as intended? I've noticed that even when supplying a generator to db.col.insert_many, memory usage still spikes (which makes inserting millions of documents difficult - though I do realize that sys.mem should be > collection size for best performance, so in reality perhaps this is nothing I should worry about?

I was under the impression that if you pass a generator to insert_many that pymongo will 'buffer' the insert into 16 or 32mb 'chunks'?

Performing this buffering/chunking manually solves the issue...

See below:

Example1 = straight insert_many (high memory usage - 2.625 GB)

Example2 = 'buffered' insert_many (expected [low] memory usage - ~300 MB)

import itertools
from itertools import chain,islice
import pymongo

client = pymongo.MongoClient()
db=client['test']

def generate_kv(N):
    for i in range(N):
        yield {'x': i}

print "example 1"
db.testcol.drop()
db.testcol.insert_many(generate_kv(5000000))

def chunks(iterable, size=10000):
        iterator = iter(iterable)
        for first in iterator:
            yield chain([first], islice(iterator, size - 1))

print "example 2"
db.testcol.drop()
for c in chunks(generate_kv(5000000)):
        db.testcol.insert_many(c)

Any ideas? Bug? Am I using this wrong?

like image 310
Bryan Avatar asked May 18 '16 08:05

Bryan


People also ask

Which function in PyMongo will allow to insert multiple documents?

insertMany() can insert multiple documents into a collection. Pass an array of documents to the method. If the documents do not specify an _id field, MongoDB adds the _id field with an ObjectId value to each document.

How do I insert multiple documents into MongoDB?

You are allowed to insert multiple documents in the collection by using db. collection. insertMany() method. insertMany() is a mongo shell method, which can insert multiple documents.

What is batch insert in MongoDB?

Multiple documents can be inserted at a time in MongoDB using bulk insert operation where an array of documents is passed to the insert method as parameter.


1 Answers

I think that happens because for insert_many pymongo need to have a complete list of operations, not iterable. After this list will be sent to MongoDB and after that, it will be processing.

  • If you want/need to use iterable (e.g. long document generation) - you can use simple insert.
  • If you have a big amount of documents that fit your RAM - you can send bulk insert (insert_many).
  • In other cases - just split by biggest chunks that you can and send to MongoDB.

This is normal behavior for databases.

like image 151
wowkin2 Avatar answered Sep 27 '22 19:09

wowkin2