I use bulk_create
to insert 1 mio records to a new table. It takes 80 seconds. Django only uses one CPU core (roughly 25% CPU, but no core is reaching 100%) I believe there is improvement potential.
Here is the code
class Stock(models.Model):
code = models.CharField(max_length=6, unique=True)
name = models.CharField(max_length=8)
class Watchers(models.Model):
date = models.DateField(db_index=True)
stock = models.ForeignKey(Stock, unique_for_date="date")
num = models.PositiveIntegerField(default=0)
batch = []
for input in inputs:
watcher = Watcher(date=input['date'], stock=get_stock(), num=input['num'])
batch.append(watcher)
Watcher.objects.bulk_create(batch)
I've tried several things:
batch_size
. The best value for me is roughly 4000. It takes 80-90 seconds.ThreadPool()
. It is much slower, roughly 120-140 secondsDateField
. A bit slower than 1. I'm using MySQL 5.6 community edition. Storage engine is MyISAM. Here is the config.
[mysqld]
server-id = 1
default-storage-engine = MyISAM
port = 3306
socket = /tmp/mysql.sock
datadir = "{MYSQLDATAPATH}"
skip-external-locking
explicit_defaults_for_timestamp = TRUE
# MyISAM #
key-buffer-size = 32M
max_allowed_packet = 16M
# CACHES AND LIMITS #
tmp-table-size = 32M
max-heap-table-size = 32M
query-cache-type = 0
query-cache-size = 0
max-connections = 500
thread-cache-size = 50
open-files-limit = 65535
table-definition-cache = 1024
table-open-cache = 2048
# LOGGING
log-bin = mysql-bin
binlog_format = mixed
When I import another table (similar structure, same indices, but has 9 columns), it takes 15 minutes. The time increase is not linear.
Anything wrong with bulk_create?
Although I've accept the answer, but I think I should understand the mystery. So I did some more tests and found that the Django's model creation is the root cause of slow down. When I have 800000 records, calling 800000 times create a model will be very time consuming.
The ORM framework does many internal works that we do not see, for instance the integrity check. In my case, massive records will be imported to an empty database table, so the check is not necessary.
Now I use cursor.executemany()
, which shorten the insertion time of 800000 4-column records from 54 second to 16 second. And shorten the insertion time of 800000 13-column records from 13 minutes to 46 seconds.
According to my experiment, you'd call executemany
for every 3000-5000 records. I tried once 800k records in one call, this is extremely slow.
While bulk_create is useful for saving a small number of records while processing an HTML form, it's not ideally suited for saving thousands of records. As you have found out already, it's slow because it needs a lot of memory and sends a very large query to the database. Fortunatley LOAD DATA IN FILE comes to the rescue.
The LOAD DATA INFILE statement reads rows from a text file into a table at a very high speed. LOAD DATA INFILE is the complement of SELECT ... INTO OUTFILE.
We can produce a file similar to what's produced by using csv writer the following example is from the documentation.
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(someiterable)
Finally as you have already found out, the LOCAL option to LOAD DATA .. can sometimes be used for convenience.
LOCAL works only if your server and your client both have been configured to permit it
When using this option, the file does not need to be manually transferred to the server. You can generate the CSV file on the client side and the local option will cause the mysql client to automatically transfer the file to the server.
If LOCAL is not specified, the file must be located on the server host and is read directly by the server.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With