We have an app with highly interrelated data, i.e. there are many cases where two objects might refer to the same object via a relationship. As far as I can tell, Django does not make any attempt to return a reference to an already-fetched object if you attempt to fetch it via a different, previously unevaluated relationship. For example: <pre class="prettyprint"><code>class Customer( Model ): firstName = CharField( max_length = 64 ) lastName = CharField( max_length = 64 ) class Order( Model ): customer = ForeignKey( Customer, related_name = "orders" ) </code></pre> Then assume we have a single customer who has two orders in the DB: <pre class="prettyprint"><code>order1, order2 = Order.objects.all() print order1.customer # (1) One DB fetch here print order2.customer # (2) Another DB fetch here print order1.customer == order2.customer # (3) True, because PKs match print id( order1.customer ) == id( order2.customer ) # (4) False, not the same object </code></pre> When you have highly interrelated data, the degree to which accessing relationships of your objects results in repeated queries of the DB for the same data increases and becomes a problem. We also program for iOS and one of the nice things about CoreData is that it maintains context, so that in a given context there is only ever one instance of a given model. In the example given above, CoreData would not have done the second fetch at (2), because it would have resolved the relationship using the customer already in memory. Even if line (2) was replaced with a spurious example designed to force another DB fetch (like <code>print Order.objects.exclude( pk = order1.pk ).get( customer = order1.customer )</code>), CoreData would realize that the result of that second fetch resolved to an model in memory and return the existing model instead of a new one (i.e. (4) would print True in CoreData because they would actually be the same object). To hedge against this behaviour of Django, we are kinda writing all this horrible stuff to try to cache models in memory by their <code>(type, pk)</code> and then check relationships with the <code>_id</code> suffix to try to pull them from the cache before blindly hitting the DB with another fetch. This is cutting down on DB throughput but feels really brittle and likely to cause problems if normal relationship lookups via properties accidentally happen in some contrib framework or middleware that we don't control. Are there any best practices or frameworks out there for Django to help avoid this problem? Has anyone attempted to install some kind of thread-local context into Django's ORM to avoid repeat lookups and having multiple in-memory instances mapping to the same DB model? I know that query-caching stuff like JohnnyCache is out there (and helps cut down on the DB throughput) however there is still the issue of multiple instances mapping to the same underlying model even with those measures in place.

There's a relevant DB optimization page in django documentation; basically callables are not cached, but attributes are (subsequent calls to <code>order1.customer</code> don't hit the database), though only in the context of their object owner (so, not sharing among different orders). using cache As you say, one way to solve your problem is to use a database cache. We use bitbucket's johnny cache, which is almost completely transparent; another good transparent one is mozilla's cache machine. You also have the choice for less-transparent caching systems that might actually better fit the bill, please see djangopackages/caching. Adding a cache can indeed be very beneficial if different requests need to re-use the same Customer; but please read this wich applies to most transparent cache systems to think through if your Write/Read pattern suits such a caching system. optimizing the requests Another approach for your precise example is to use <code>select_related</code>. <pre class="prettyprint"><code>order1, order2 = Order.objects.all().select_related('customer') </code></pre> This way the <code>Customer</code> object will be loaded straight away in the same sql request, with little cost (unless it's a very big record) and no need to experiment with other packages.

Avoiding multiple references to the same object in Django ORM

Tags:

python

orm

django

We have an app with highly interrelated data, i.e. there are many cases where two objects might refer to the same object via a relationship. As far as I can tell, Django does not make any attempt to return a reference to an already-fetched object if you attempt to fetch it via a different, previously unevaluated relationship.

For example:

class Customer( Model ):
    firstName = CharField( max_length = 64 )
    lastName = CharField( max_length = 64 )

class Order( Model ):
    customer = ForeignKey( Customer, related_name = "orders" )

Then assume we have a single customer who has two orders in the DB:

order1, order2 = Order.objects.all()
print order1.customer # (1) One DB fetch here
print order2.customer # (2) Another DB fetch here
print order1.customer == order2.customer # (3) True, because PKs match
print id( order1.customer ) == id( order2.customer ) # (4) False, not the same object

When you have highly interrelated data, the degree to which accessing relationships of your objects results in repeated queries of the DB for the same data increases and becomes a problem.

We also program for iOS and one of the nice things about CoreData is that it maintains context, so that in a given context there is only ever one instance of a given model. In the example given above, CoreData would not have done the second fetch at (2), because it would have resolved the relationship using the customer already in memory.

Even if line (2) was replaced with a spurious example designed to force another DB fetch (like print Order.objects.exclude( pk = order1.pk ).get( customer = order1.customer )), CoreData would realize that the result of that second fetch resolved to an model in memory and return the existing model instead of a new one (i.e. (4) would print True in CoreData because they would actually be the same object).

To hedge against this behaviour of Django, we are kinda writing all this horrible stuff to try to cache models in memory by their (type, pk) and then check relationships with the _id suffix to try to pull them from the cache before blindly hitting the DB with another fetch. This is cutting down on DB throughput but feels really brittle and likely to cause problems if normal relationship lookups via properties accidentally happen in some contrib framework or middleware that we don't control.

Are there any best practices or frameworks out there for Django to help avoid this problem? Has anyone attempted to install some kind of thread-local context into Django's ORM to avoid repeat lookups and having multiple in-memory instances mapping to the same DB model?

I know that query-caching stuff like JohnnyCache is out there (and helps cut down on the DB throughput) however there is still the issue of multiple instances mapping to the same underlying model even with those measures in place.

686

asked Jan 18 '12 17:01

glenc

2 Answers

David Cramer's django-id-mapper is one attempt to do this.

answered Nov 02 '22 01:11

Daniel Roseman

There's a relevant DB optimization page in django documentation; basically callables are not cached, but attributes are (subsequent calls to order1.customer don't hit the database), though only in the context of their object owner (so, not sharing among different orders).

using cache

As you say, one way to solve your problem is to use a database cache. We use bitbucket's johnny cache, which is almost completely transparent; another good transparent one is mozilla's cache machine. You also have the choice for less-transparent caching systems that might actually better fit the bill, please see djangopackages/caching.

Adding a cache can indeed be very beneficial if different requests need to re-use the same Customer; but please read this wich applies to most transparent cache systems to think through if your Write/Read pattern suits such a caching system.

optimizing the requests

Another approach for your precise example is to use select_related.

order1, order2 = Order.objects.all().select_related('customer')

This way the Customer object will be loaded straight away in the same sql request, with little cost (unless it's a very big record) and no need to experiment with other packages.

answered Nov 02 '22 01:11

Stefano

Related questions
                            
                                Boost::python Exposing C++ functions using and returning templates
                            
                                Distorting an image using a height map?
                            
                                Matplotlib.pyplot on OS X with 64-bit Python from Python.org
                            
                                Event correlation and filtering - How to, where-to start?
                            
                                matplotlib: is a changing background color possible?
                            
                                How to fix premature convergence in simple GA (Python)?
                            
                                c++0x std::shared_ptr vs. boost::shared_ptr
                            
                                Pre-hashed string keys for faster Python dictionaries lookup?
                            
                                How to load user CSS in a WebKit WebView using PyObjC?
                            
                                sl4a python notify question
                            
                                Python: add a parent class to a class after initial evaluation
                            
                                Simple RTMP Python client
                            
                                How to label axes in Mayavi using LaTeX math symbols?
                            
                                Converting timezone-aware date string to UTC and back in Python
                            
                                How can I fix Vim's line breaking behavior for long lines in Python?
                            
                                python GUI compared to Swing?
                            
                                How to catch error 1062 "duplicate entry" independent from used database/engine?
                            
                                Browserless access to LinkedIn with Python
                            
                                Python's urllib2.urlopen() hanging with local connection to a Java Restlet server
                            
                                Preventing imports in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With