How can I improve my INSERT statement performance?

Tags:

While Josh's answer here gave me a good head start on how to insert a 256x64x250 value array into a MySQL database. When I actually tried his INSERT statement on my data it turned out horribly slow (as in 6 minutes for a 16Mb file).

ny, nx, nz = np.shape(data)
query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s)"""
for frames in range(nz):
    for rows in range(ny):
        for cols in range(nx):
            cursor.execute(query, (frames, rows, cols, data[rows,cols,frames]))

I was reading MySQL for Python, which explained that this wasn't the right approach because executing 4 million separate inserts is very inefficient.

Now my data consist out of a lot of zeros (more than 90% actually), so I threw in an IF statement so I only insert values greater than zero and I used executemany() instead:

query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """
values = []
for frames in range(nz):
    for rows in range(ny):
        for cols in range(nx):
            if data[rows,cols,frames] > 0.0:
                values.append((frames, rows, cols, data[rows,cols,frames]))           
cur.executemany(query, values)

This miraculously brought my processing time down to about 20 seconds, of which 14 seconds spend on creating the list of values (37k rows) and 4 seconds on the actual inserting into the database.

So now I'm wondering, how can I speed up this process any further? Because I have a feeling my loop is horribly inefficient and there has to be a better way. If I need to insert 30 measurements per dog, this would still take 10 minutes, which seems far too long for this amount of data.

Here are two versions of my raw files: with headers or without headers. I'd love to try the LOAD DATA INFILE, but I can't figure out how to parse the data correctly.

665

asked Mar 27 '11 15:03

Ivo Flipse

2 Answers

the fastest way to insert 4 million rows (16MB of data) would be to use load data infile - http://dev.mysql.com/doc/refman/5.0/en/load-data.html

so if possible generate a csv file then use load data infile..

hope this helps :)

EDIT

So I took one of your original data files rolloff.dat and wrote a quick and dirty program to convert it to the following csv format.

Download frames.dat from here: http://rapidshare.com/files/454896698/frames.dat

Frames.dat

patient_name, sample_date dd/mm/yyyy, frame_time (ms), frame 0..248, row 0..255, col 0..62, value
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,5,39,0.4
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,5,40,0.4
...
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,10,42,0.4
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,10,43,0.4
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,4,40,0.4
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,5,39,0.4
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,5,40,0.7
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,6,44,0.7
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,6,45,0.4
...
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,10,0.4
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,11,0.4
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,12,1.1
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,13,1.4
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,14,0.4

The file contains data only for frames that have values for each row and col - so zeros are excluded. 24799 data rows were generated from your original file.

Next, I created a temporary loading (staging) table into which the frames.dat file is loaded. This is a temporary table which will allow you to manipulate/transform the data before loading into the proper production/reporting tables.

drop table if exists sample_temp;
create table sample_temp
(
patient_name varchar(255) not null,
sample_date date,
frame_time decimal(6,2) not null default 0,
frame_id tinyint unsigned not null,
row_id tinyint unsigned not null,
col_id tinyint unsigned not null,
value decimal(4,1) not null default 0,
primary key (frame_id, row_id, col_id)
)
engine=innodb;

All that remains is to load the data (note: i am using windows so you'll have to edit this script to make it linux compatible - check pathnames and change '\r\n' to '\n')

truncate table sample_temp;

start transaction;

load data infile 'c:\\import\\frames.dat' 
into table sample_temp
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\r\n'
ignore 1 lines
(
patient_name,
@sample_date,
frame_time,
frame_id,
row_id,
col_id,
value
)
set 
sample_date = str_to_date(@sample_date,'%d/%m/%Y');

commit;

Query OK, 24799 rows affected (1.87 sec)
Records: 24799  Deleted: 0  Skipped: 0  Warnings: 0

The 24K rows were loaded in 1.87 seconds.

Hope this helps :)

199

answered Oct 31 '22 15:10

Jon Black

If the data is a numpy array, you can try this:

query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """
values = []
rows, cols, frames = numpy.nonzero(data)
for row, col, frame in zip(rows, cols, frames):
    values.append((frame, row, col, data[row,col,frame]))

cur.executemany(query, values)

query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """
rows, cols, frames = numpy.nonzero(data)
values = [(row, col, frame, val) for row, col, frame, val in zip(rows, cols, frames, data[rows,cols,frames])]
cur.executemany(query, values)

Hope it helps

answered Oct 31 '22 17:10

Hernan

Related questions
                            
                                How to grab the lines AFTER a matched line in python
                            
                                `from x import y` vs. `from x.y import *`
                            
                                PyAudio trying to use JACK
                            
                                Plotting system of (implicit) equations in matplotlib [duplicate]
                            
                                Django and root processes
                            
                                How to transfer basic objects between Ruby and Python?
                            
                                Avoid Redundancy in Python
                            
                                cannot dump data by using python ./manage.py dumpdata app
                            
                                How can I determine if any errors were logged during a python program's execute?
                            
                                How to import python module with same filename that is already imported?
                            
                                Sqlalchemy type for very long text for articles
                            
                                determining numerator and denominator in a relatively complex mathematical expression using python
                            
                                PyObjC tutorial without Xcode [closed]
                            
                                reportlab: setting colspan for td in rml
                            
                                Scrapy: connection refused
                            
                                Python CRC-32 woes
                            
                                Is there something like `ForeignKey` in Google App Engine's `webapp`?
                            
                                Python String encode method
                            
                                sqlite3.OperationalError: near "X": syntax error
                            
                                Python frameworks for developing facebook apps

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I improve my INSERT statement performance?

Tags:

python

mysql

Ivo Flipse

People also ask

2 Answers

Jon Black

Hernan

Recent Activity

Donate For Us