Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django data sharding

Tags:

django

I have successfully got my application running over several databases using the routing scheme based on models. I.e. model A lives on DB A and model B lives on DB B. I now need to shard my data. I am looking at the docs and having trouble working out how to do it as the same model needs to exist on multiple database servers. I want to have a flag to say DB for NEW members is now database X and that members X-Y live on database N etc.

How do I do that? Is it using **hints, this seems inadequately documented to me.

like image 673
Rich Avatar asked Dec 19 '10 03:12

Rich


1 Answers

The hints parameter is designed to help your database router decide where it should read or write its data. It may evolve with future versions of python, but for now there's just one kind of hint that may be given by the Django framework, and that's the instance it's working on.

I wrote this very simple database router to see what Django does:

# routers.py
import logging
logger = logging.getLogger("my_project")

class DebugRouter(object):
    """A debugging router"""

    def db_for_read(self, model, **hints):
        logger.debug("db_for_read %s" % repr((model, hints)))
        return None

    def db_for_write(self, model, **hints):
        logger.debug("db_for_write %s" % repr((model, hints)))
        return None

    def allow_relation(self, obj1, obj2, **hints):
        logger.debug("allow_relation %s" % repr((obj1, obj2, hints)))
        return None

    def allow_syncdb(self, db, model):
        logger.debug("allow_syncdb %s" % repr((db, model)))
        return None

You declare this in settings.py:

DATABASE_ROUTERS = ["my_project.routers.DebugRouter"]

Make sure logging is properly configured to output debug output (for example to stderr):

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        [...some other handlers...] 
        'stderr': {
            'level': 'DEBUG',
            'class': 'logging.StreamHandler'
        }
    },
    'loggers': {
        [...some other loggers...]
        'my_project': {
            'handlers': ['stderr'],
            'level': 'DEBUG',
            'propagate': True,
        },
    }
}

Then you can open a Django shell and test a few requests to see what data your router is being given:

$ ./manage.py shell
[...]
>>> from my_project.my_app.models import User
>>> User.objects.get(pk = 1234)
db_for_read (<class 'my_project.my_app.models.User'>, {})
<User: User object>
>>> user = User.objects.create(name = "Arthur", title = "King")
db_for_write (<class 'my_project.my_app.models.User'>, {})
>>> user.name = "Kong"
>>> user.save()
db_for_write (<class 'my_project.my_app.models.User'>, {'instance':
              <User: User object>})
>>>

As you can see, the hints is always empty when no instance is available (in memory) yet. So you cannot use routers if you need query parameters (the object's id for example) in order to determine which database to query. It might be possible in the future if Django provides the query or queryset objects in the hints dict.

So to answer your question, I would say that for now you must create a custom Manager, as suggested by Aaron Merriam. But overriding just the create method is not enough, since you also need to be able to fetch an object in the appropriate database. Something like this might work (not tested yet):

class CustomManager(models.Manager)
    def self.find_database_alias(self, pk):
        return #... implement the logic to determine the shard from the pk

    def self.new_object_database_alias(self):
        return #... database alias for a new object

    def get(self, *args, **kargs):
        pk = kargs.get("pk")
        if pk is None:
            raise Exception("Sharded table: you must provide the primary key")
        db_alias = self.find_database_alias(pk)
        qs = self.get_query_set().using(db_alias)
        return qs.get(*args, **kargs)

    def create(self, *args, **kwargs):
        db_alias = self.new_object_database_alias()
        qs = super(CustomManager, self).using(db_alias)
        return qs.create(*args, **kwargs)

class ModelA(models.Model):
    objects = CustomManager()

Cheers

like image 106
MiniQuark Avatar answered Oct 04 '22 02:10

MiniQuark