Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practice of bulk_create for massive records

I use bulk_create to insert 1 mio records to a new table. It takes 80 seconds. Django only uses one CPU core (roughly 25% CPU, but no core is reaching 100%) I believe there is improvement potential.

Here is the code

class Stock(models.Model):
    code = models.CharField(max_length=6, unique=True)
    name = models.CharField(max_length=8)

class Watchers(models.Model):
    date = models.DateField(db_index=True)
    stock = models.ForeignKey(Stock, unique_for_date="date")
    num = models.PositiveIntegerField(default=0)

batch = []
for input in inputs:
    watcher = Watcher(date=input['date'], stock=get_stock(), num=input['num'])
    batch.append(watcher)
Watcher.objects.bulk_create(batch)

I've tried several things:

  1. Use a proper batch_size. The best value for me is roughly 4000. It takes 80-90 seconds.
  2. Use a ThreadPool(). It is much slower, roughly 120-140 seconds
  3. Remove the index on DateField. A bit slower than 1.

I'm using MySQL 5.6 community edition. Storage engine is MyISAM. Here is the config.

[mysqld]
server-id   = 1
default-storage-engine = MyISAM
port        = 3306
socket      = /tmp/mysql.sock
datadir     = "{MYSQLDATAPATH}"
skip-external-locking
explicit_defaults_for_timestamp = TRUE

# MyISAM #
key-buffer-size = 32M
max_allowed_packet = 16M

# CACHES AND LIMITS #
tmp-table-size                 = 32M
max-heap-table-size            = 32M
query-cache-type               = 0
query-cache-size               = 0
max-connections                = 500
thread-cache-size              = 50
open-files-limit               = 65535
table-definition-cache         = 1024
table-open-cache               = 2048

# LOGGING
log-bin       = mysql-bin
binlog_format = mixed

When I import another table (similar structure, same indices, but has 9 columns), it takes 15 minutes. The time increase is not linear.

Anything wrong with bulk_create?


Update 1

Although I've accept the answer, but I think I should understand the mystery. So I did some more tests and found that the Django's model creation is the root cause of slow down. When I have 800000 records, calling 800000 times create a model will be very time consuming.

The ORM framework does many internal works that we do not see, for instance the integrity check. In my case, massive records will be imported to an empty database table, so the check is not necessary.

Now I use cursor.executemany(), which shorten the insertion time of 800000 4-column records from 54 second to 16 second. And shorten the insertion time of 800000 13-column records from 13 minutes to 46 seconds.

According to my experiment, you'd call executemany for every 3000-5000 records. I tried once 800k records in one call, this is extremely slow.

like image 202
stanleyxu2005 Avatar asked Sep 27 '15 07:09

stanleyxu2005


1 Answers

While bulk_create is useful for saving a small number of records while processing an HTML form, it's not ideally suited for saving thousands of records. As you have found out already, it's slow because it needs a lot of memory and sends a very large query to the database. Fortunatley LOAD DATA IN FILE comes to the rescue.

The LOAD DATA INFILE statement reads rows from a text file into a table at a very high speed. LOAD DATA INFILE is the complement of SELECT ... INTO OUTFILE.

We can produce a file similar to what's produced by using csv writer the following example is from the documentation.

import csv
    with open('some.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(someiterable)

Finally as you have already found out, the LOCAL option to LOAD DATA .. can sometimes be used for convenience.

LOCAL works only if your server and your client both have been configured to permit it

When using this option, the file does not need to be manually transferred to the server. You can generate the CSV file on the client side and the local option will cause the mysql client to automatically transfer the file to the server.

If LOCAL is not specified, the file must be located on the server host and is read directly by the server.

like image 99
e4c5 Avatar answered Oct 24 '22 20:10

e4c5