Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird behaviour for transactions in django 1.6.1

I am using transaction.atomic as a context manager for transactions in django 1.6. There is a block of code which I want to be in a transaction which has a couple of network calls and some database writes. I am seeing very weird behaviour. Every once in while (maybe 1 in 20 times) I have noticed a partial rollback happening without any exception having been raised and the view executing without any errors. My application is hosted on heroku and we use heroku postgres v9.2.8. Pseudo code:

from django.db import transaction

def some_view(request):

    try:
        with transation.atomic():
            network_call_1()
            db_write_1.save(update_fields=['col4',])
            db_write_2.save(update_fields=['col3',])
            db_write_3.save(update_fields=['col1',])
            network_call_2()
            db_write_4.save(update_fields=['col6',])
            db_write_5.bulk_create([object1, object2])
            db_write_6.bulk_create([object1, object2])
    except Exception, e:
        logger.error(e)

    return HttpResponse()

The behaviour that I have noticed is that without any exception having been raised, either db write 1-3 have rolled back and the rest gone through or db write 1 has been rolled back and rest have gone through and so on. I don't understand why this should be happening. First, if there is a rollback, shouldn't it be a complete rollback of the transaction? If there is a rollback shouldn't an exception also be raised so that I know a rollback has happened? Everytime this has happened, no exception has been raised and the code just continues executing and returns a successful HttpResponse.

Relevant settings:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': 'mydb',
        'USER': 'root',
        'PASSWORD': 'root',
        'HOST': 'localhost',
        'PORT': '5432',
    },
}
CONN_MAX_AGE = None

This bug has me baffled since days. Any clues will be of great help!

like image 734
tapan Avatar asked Feb 26 '14 21:02

tapan


3 Answers

After hours of debugging, we have found the culprit.

When we start our application on gunicorn, it spawns workers. Every request coming to the same worker uses the same django DatabaseWrapper instance (postgres in our case) also referred to as a connection. If, in the middle of a transaction in one request, the worker were to receive another request, this request resets the state of the connection causing the transaction to behave in unexpected ways as documented in this bug: https://code.djangoproject.com/ticket/21239 Sometimes the transaction doesn't get committed and there is no exception raised to let you know that happened. Sometimes parts of it do get committed while the rest is lost and it looks like a partial rollback.

We thought that a connection is thread safe, but this bit of gunicorn patching magic here makes sure that's not the case: https://github.com/benoitc/gunicorn/blob/18.0/gunicorn/management/commands/run_gunicorn.py#L16

Still open to suggestions on how to sidestep this issue if possible at all.

EDIT: Don't use the run_gunicorn management command to start Django. It does some funky patching which causes DB connections to not be thread safe. The solution that worked for us is to just use "gunicorn myapp.wsgi:application -c gunicorn.conf". Django persistent DB connections don't work with the gevent worker type yet so avoid using that unless you want to run out of connections.

like image 171
tapan Avatar answered Nov 09 '22 11:11

tapan


Not a Django expert, but I do know Postgres. I agree with your assessment that this sounds like very atypical behavior for a transaction: the rollback should be all-or-nothing, and there should be an exception. That being the case, can you be absolutely certain that this is a rollback-type situation? There are lots of other possible causes that could account for different data appearing in the database than you expected, and many of those scenarios would fit better with your observed occurrences that does a rollback.

You haven't provided any specifics as to your data, but what I imagine is, you're seeing something like "I set the value of col4 to 'foo', but after the commit, the old value 'bar' is still in the database." Is that correct?

If so, then other possible causes could be:

  • The code that is supposed to setting the 'foo' value somehow, on occasion, is actually setting either the existing 'bar' value, or a NULL value.
  • The code is setting the 'foo' value, but the there is a data access layer (aka DAL) with a 'dirty' flag that is not being set (e.g. if the object is in a disconnected state), so when the commit is done, the DAL doesn't see that as being a change it is supposed to write.

These are just a few examples to get you started. There are lots of other possible scenarios. Sometimes, the basic philosophy of debugging problems like this is similar to the problem of the DDT and pelicans: since the database is at the top of the food chain, you can often see problems there that-- while they appear to be database problems-- are actually caused somewhere else in your solution.

Good luck and hope that helps!

like image 21
Robert N Avatar answered Nov 09 '22 09:11

Robert N


My 3 cents:

Exceptions

We're certain no exceptions have occurred. But are we? Your pseudo-code "handles" an exception by just logging. Make sure there are no exceptions "handled" elsewhere by logging or pass.

The partial rollback

We expect the the whole transaction to be rolled back, not just part. Since django 1.6 nested atomic transactions create a savepoint and rollbacks go back to the last savepoint. Make sure there are no nested transactions. Perhaps you have transaction middleware active check ATOMIC_REQUESTS and MIDDLEWARE_CLASSES. Maybe transactions are started in those network_call functions.

Reproducing

Since that network_call code may block. Try to replace them with mock calls, that timeout (maybe not in production). If that results in 100% (partial) rollbacks. It should make locating the problem of partial rollbacks easier.

like image 38
Chris Wesseling Avatar answered Nov 09 '22 10:11

Chris Wesseling