Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django adaptors CSV taking hours to import

Tags:

python

django

I'm using Django adaptors to upload a simple CSV. It seems to work perfectly when I'm importing 100 or 200 contacts. But when I try to upload a 165kb file with 5000 contacts, it never completes. I let let it keep trying, and when I came back after 1 hour it was still trying.

What's wrong with this? There is no way a 165kb file should take over an hour to import with Django adaptors. Is there something wrong with the code?

 def process(self):
        self.date_start_processing = timezone.now()
        try:


            # Try and import CSV
            ContactCSVModel.import_data(data=self.filepath, extra_fields=[
                {'value': self.group_id, 'position': 5},
                {'value': self.uploaded_by.id, 'position': 6}])

            self._mark_processed(self.num_records)
        except Exception as e:
            self._mark_failed(unicode(e))

CsvModel

class ContactCSVModel(CsvModel):

    first_name = CharField()
    last_name = CharField()
    company = CharField()
    mobile = CharField()
    group = DjangoModelField(Group)
    contact_owner = DjangoModelField(User)


    class Meta:
        delimiter = "^"
        dbModel = Contact
        update = {'keys': ["mobile", "group"]}
like image 286
Prometheus Avatar asked Apr 14 '13 21:04

Prometheus


3 Answers

Split up your larger task into smaller pieces.

Step 1 - Just read a CSV file

Both ContactCSVModel.import_from_filename() and ContactCSVModel.import_from_file() return the csv lines. Disable the interaction with your django model to skip interaction with your database. This should speed up the task considerably and print the imported data. This should definitely work!

CSVModel

class ContactCSVModel(CsvModel):

    first_name = CharField()
    last_name = CharField()
    company = CharField()
    mobile = CharField()
    group = DjangoModelField(Group)
    contact_owner = DjangoModelField(User)


    class Meta:
        delimiter = "^"

Your code

 def process(self):
        self.date_start_processing = timezone.now()
        try:


            # Try and import CSV
            lines = ContactCSVModel.import_data(data=self.filepath, extra_fields=[
                {'value': self.group_id, 'position': 5},
                {'value': self.uploaded_by.id, 'position': 6}])
            print lines # or use logging

            self._mark_processed(self.num_records)
        except Exception as e:
            self._mark_failed(unicode(e))

Step 2 - enable django model interaction BUT disable to check for existing items in DB.

Disable it because this feature enabled would query the DB for every line in CSV to check for existing items according to your natural key specification (I have read the source code). Probably you know that all lines in your CSV are unique contacts.

This would help if your problems are slow DB queries during the whole import, but does not really help if the import consumes too much memory.

class ContactCSVModel(CsvModel):

    first_name = CharField()
    last_name = CharField()
    company = CharField()
    mobile = CharField()
    group = DjangoModelField(Group)
    contact_owner = DjangoModelField(User)


    class Meta:
        delimiter = "^"
        dbModel = Contact

Step 3 - Import equally sized chunks of CSV

Use the CSVModel and enable interaction with Contact model, but provide smaller iterables to ContactCSVModel.import_data(). I set it to 500. Change it to your needs. The code sample below (link) is to get you the idea. You need to change it a bit to put this into your existing code. This will help, if memory consumption is the problem.

import csv
reader = csv.reader(open(self.filepath, 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader, chunksize=500):
    ContactCSVModel.import_data(data=chunk, extra_fields=[
                {'value': self.group_id, 'position': 5},
                {'value': self.uploaded_by.id, 'position': 6}])

Step 4 - Target large memory consumption and slow operation

Because django-adaptors holds all Contact model instances in memory during import and slow operation because of multiple single commits instead of bulk insert operation - it is not well suited for larger files.

You are somewhat tied to django-adaptors. You can't switch to bulk inserts if you rely on this django package. Check the memory consumption under linux with top or htop, on windows with task manager. If the process eats to much and the OS starts swapping, switch to another django add-on with more efficient memory consumption and bulk inserts as an option - there are plenty of them for csv imports.

Another hint is to use the csv module for reading and your django Models knowledge for interacting with the database. This is not really a challenge for you - just try it with isolated tasks of your big picture and put them together if they are working - good luck.

like image 136
Sascha Gottfried Avatar answered Nov 20 '22 05:11

Sascha Gottfried


I would first check there are no data errors in the csv. Eg if a column has erroneous escape characters or incorrect data types - perhaps the DB cannot accept null values on some columns.

Whilst it is hanging, can you manually check to see if the DB is populating? Either through command line MySQL prompt or workbench? If it is, then auto-commit is turned on and you should be able to see what row it is hanging on - then check that record in the CSV.

However, if auto-commit is turned off (I' don't know what Django does by default, or how your DB is configured) then it is possible you are overflowing the transaction buffer. There should be a way to manually flush/commit the transaction in stages to get around this.

like image 30
Simon Todd Avatar answered Nov 20 '22 04:11

Simon Todd


The first thing to try is to pass an iterable to the import_data function:

ContactCSVModel.import_data(open(self.filepath), extra_fields=[
                {'value': self.group_id, 'position': 5},
                {'value': self.uploaded_by.id, 'position': 6}])

The second thing to try is to use import_from_filename:

ContactCSVModel.import_from_filename(self.filepath, extra_fields=[
                {'value': self.group_id, 'position': 5},
                {'value': self.uploaded_by.id, 'position': 6}])

If this doesn't help, try to figure out where is it hanging. You can do it manually by reducing the size of your csv file, or you can put a mock on csv.reader, or you can mock CsvImporter.process_line and instead of processing lines, print out them to see where does it stop. Let me know if you need help with mocking.

Also, this issue can be related.

like image 2
alecxe Avatar answered Nov 20 '22 04:11

alecxe