Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accelerate bulk insert using Django's ORM?

I'm planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django's ORM. Currently each file takes ~20min to process, and I was wondering if there's any way to accelerate this process.

I've taken the following measures:

  • Use @transaction.commit_manually and commit once every 5000 records
  • Set DEBUG=False so that django won't accumulate all the sql commands in memory
  • The loop that runs over records in a single file is completely contained in a single function (minimize stack changes)
  • Refrained from hitting the db for queries (used a local hash of objects already in the db instead of using get_or_create)
  • Set force_insert=True in the save() in hopes it will save django some logic
  • Explicitly set the id in hopes it will save django some logic
  • General code minimization and optimization

What else can I do to speed things up? Here are some of my thoughts:

  • Use some kind of Python compiler or version which is quicker (Psyco?)
  • Override the ORM and use SQL directly
  • Use some 3rd party code that might be better (1, 2)
  • Beg the django community to create a bulk_insert function

Any pointers regarding these items or any other idea would be welcome :)

like image 302
Jonathan Livni Avatar asked Nov 27 '10 21:11

Jonathan Livni


1 Answers

Django 1.4 provides a bulk_create() method on the QuerySet object, see:

  • https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
  • https://docs.djangoproject.com/en/dev/releases/1.4/
  • https://code.djangoproject.com/ticket/7596
like image 149
Gary Avatar answered Sep 21 '22 13:09

Gary