I have the following mongoengine model: <pre class="prettyprint"><code>class MyModel(Document): date = DateTimeField(required = True) data_dict_1 = DictField(required = False) data_dict_2 = DictField(required = True) </code></pre> In some cases the document in the DB can be very large (around 5-10MB), and the data_dict fields contain complex nested documents (dict of lists of dicts, etc...). I have encountered two (possibly related) issues: <ol> <li>When I run native pymongo find_one() query, it returns within a second. When I run MyModel.objects.first() it takes 5-10 seconds.</li> <li> When I query a single large document from the DB, and then access its field, it takes 10-20 seconds just to do the following: <pre class="prettyprint"><code>m = MyModel.objects.first() val = m.data_dict_1.get(some_key) </code></pre> </li> </ol> The data in the object does not contain any references to any other objects, so it is not an issue of objects dereferencing. I suspect it is related to some inefficiency of the internal data representation of mongoengine, which affects the document object construction as well as fields access. Is there anything I can do to improve this ?

TL;DR: mongoengine is spending ages converting all the returned arrays to dicts To test this out I built a collection with a document with a <code>DictField</code> with a large nested <code>dict</code>. The doc being roughly in your 5-10MB range. We can then use <code>timeit.timeit</code> to confirm the difference in reads using pymongo and mongoengine. We can then use pycallgraph and GraphViz to see what is taking mongoengine so damn long. Here is the code in full: <pre class="prettyprint"><code>import datetime import itertools import random import sys import timeit from collections import defaultdict import mongoengine as db from pycallgraph.output.graphviz import GraphvizOutput from pycallgraph.pycallgraph import PyCallGraph db.connect("test-dicts") class MyModel(db.Document): date = db.DateTimeField(required=True, default=datetime.date.today) data_dict_1 = db.DictField(required=False) MyModel.drop_collection() data_1 = ['foo', 'bar'] data_2 = ['spam', 'eggs', 'ham'] data_3 = ["subf{}".format(f) for f in range(5)] m = MyModel() tree = lambda: defaultdict(tree) # http://stackoverflow.com/a/19189366/3271558 data = tree() for _d1, _d2, _d3 in itertools.product(data_1, data_2, data_3): data[_d1][_d2][_d3] = list(random.sample(range(50000), 20000)) m.data_dict_1 = data m.save() def pymongo_doc(): return db.connection.get_connection()["test-dicts"]['my_model'].find_one() def mongoengine_doc(): return MyModel.objects.first() if __name__ == '__main__': print("pymongo took {:2.2f}s".format(timeit.timeit(pymongo_doc, number=10))) print("mongoengine took", timeit.timeit(mongoengine_doc, number=10)) with PyCallGraph(output=GraphvizOutput()): mongoengine_doc() </code></pre> And the output proves that mongoengine is being very slow compared to pymongo: <pre class="prettyprint"><code>pymongo took 0.87s mongoengine took 25.81118331072267 </code></pre> The resulting call graph illustrates pretty clearly where the bottle neck is: <img src="https://i.stack.imgur.com/qAb0t.png" alt="pycallgraph.png for mongoengine read of large doc"> <img src="https://i.stack.imgur.com/5q8Px.png" alt="hot spot in pycallgraph"> Essentially mongoengine will call the to_python method on every <code>DictField</code> that it gets back from the db. <code>to_python</code> is pretty slow and in our example it's being called an insane number of times. Mongoengine is used to elegantly map your document structure to python objects. If you have very large unstructured documents (which mongodb is great for) then mongoengine isn't really the right tool and you should just use pymongo. However, if you know the structure you can use <code>EmbeddedDocument</code> fields to get slightly better performance from mongoengine. I've run a similar but not equivalent test code in this gist and the output is: <pre class="prettyprint"><code>pymongo with dict took 0.12s pymongo with embed took 0.12s mongoengine with dict took 4.3059175412661075 mongoengine with embed took 1.1639373211854682 </code></pre> So you can make mongoengine faster but pymongo is much faster still. UPDATE A good shortcut to the pymongo interface here is to use the aggregation framework: <pre class="prettyprint"><code>def mongoengine_agg_doc(): return list(MyModel.objects.aggregate({"$limit":1}))[0] </code></pre>

Mongoengine is very slow on large documents compared to native pymongo usage

Tags:

python

mongodb

pymongo

mongoengine

I have the following mongoengine model:

class MyModel(Document):
    date        = DateTimeField(required = True)
    data_dict_1 = DictField(required = False)
    data_dict_2 = DictField(required = True)

In some cases the document in the DB can be very large (around 5-10MB), and the data_dict fields contain complex nested documents (dict of lists of dicts, etc...).

I have encountered two (possibly related) issues:

When I run native pymongo find_one() query, it returns within a second. When I run MyModel.objects.first() it takes 5-10 seconds.
When I query a single large document from the DB, and then access its field, it takes 10-20 seconds just to do the following:
```
m = MyModel.objects.first()
val = m.data_dict_1.get(some_key)
```

The data in the object does not contain any references to any other objects, so it is not an issue of objects dereferencing.
I suspect it is related to some inefficiency of the internal data representation of mongoengine, which affects the document object construction as well as fields access. Is there anything I can do to improve this ?

542

asked Feb 07 '16 18:02

Baruch Oxman

1 Answers

TL;DR: mongoengine is spending ages converting all the returned arrays to dicts

To test this out I built a collection with a document with a DictField with a large nested dict. The doc being roughly in your 5-10MB range.

We can then use timeit.timeit to confirm the difference in reads using pymongo and mongoengine.

We can then use pycallgraph and GraphViz to see what is taking mongoengine so damn long.

Here is the code in full:

import datetime
import itertools
import random
import sys
import timeit
from collections import defaultdict

import mongoengine as db
from pycallgraph.output.graphviz import GraphvizOutput
from pycallgraph.pycallgraph import PyCallGraph

db.connect("test-dicts")


class MyModel(db.Document):
    date = db.DateTimeField(required=True, default=datetime.date.today)
    data_dict_1 = db.DictField(required=False)


MyModel.drop_collection()

data_1 = ['foo', 'bar']
data_2 = ['spam', 'eggs', 'ham']
data_3 = ["subf{}".format(f) for f in range(5)]

m = MyModel()
tree = lambda: defaultdict(tree)  # http://stackoverflow.com/a/19189366/3271558
data = tree()
for _d1, _d2, _d3 in itertools.product(data_1, data_2, data_3):
    data[_d1][_d2][_d3] = list(random.sample(range(50000), 20000))
m.data_dict_1 = data
m.save()


def pymongo_doc():
    return db.connection.get_connection()["test-dicts"]['my_model'].find_one()


def mongoengine_doc():
    return MyModel.objects.first()


if __name__ == '__main__':
    print("pymongo took {:2.2f}s".format(timeit.timeit(pymongo_doc, number=10)))
    print("mongoengine took", timeit.timeit(mongoengine_doc, number=10))
    with PyCallGraph(output=GraphvizOutput()):
        mongoengine_doc()

And the output proves that mongoengine is being very slow compared to pymongo:

pymongo took 0.87s
mongoengine took 25.81118331072267

The resulting call graph illustrates pretty clearly where the bottle neck is:

pycallgraph.png for mongoengine read of large doc hot spot in pycallgraph

Essentially mongoengine will call the to_python method on every DictField that it gets back from the db. to_python is pretty slow and in our example it's being called an insane number of times.

Mongoengine is used to elegantly map your document structure to python objects. If you have very large unstructured documents (which mongodb is great for) then mongoengine isn't really the right tool and you should just use pymongo.

However, if you know the structure you can use EmbeddedDocument fields to get slightly better performance from mongoengine. I've run a similar but not equivalent test code in this gist and the output is:

pymongo with dict took 0.12s
pymongo with embed took 0.12s
mongoengine with dict took 4.3059175412661075
mongoengine with embed took 1.1639373211854682

So you can make mongoengine faster but pymongo is much faster still.

UPDATE

A good shortcut to the pymongo interface here is to use the aggregation framework:

def mongoengine_agg_doc():
    return list(MyModel.objects.aggregate({"$limit":1}))[0]

answered Oct 22 '22 08:10

Steve Rossiter

Related questions
                            
                                Add to integers in a list
                            
                                xvfb run error in ubuntu 11.04
                            
                                Styling long chains in Python
                            
                                Arguments to cv2::imshow
                            
                                Applying map for partial argument
                            
                                Why does a python module act like a singleton?
                            
                                SQLAlchemy and UnicodeDecodeError
                            
                                Python list.remove() skips next element in list
                            
                                Does the `shell` in `shell=True` in subprocess means `bash`?
                            
                                Django -- Conditional Login Redirect
                            
                                Increase all of a lists values by an increment [duplicate]
                            
                                permanently remove directory from python path
                            
                                Error using cv2.equalizeHist
                            
                                Search for a value in a nested dictionary python
                            
                                How to make a list from a raw_input in python? [duplicate]
                            
                                How do I remove verbs, prepositions, conjunctions etc from my text? [closed]
                            
                                Sqlite - Use backticks (`) or double quotes (") with python
                            
                                Python argparse value range help message appearance
                            
                                How to create categorical variable based on a numerical variable
                            
                                What is the easiest way to detect key presses in python 3 on a linux machine?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With