Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I improve my INSERT statement performance?

Tags:

python

mysql

While Josh's answer here gave me a good head start on how to insert a 256x64x250 value array into a MySQL database. When I actually tried his INSERT statement on my data it turned out horribly slow (as in 6 minutes for a 16Mb file).

ny, nx, nz = np.shape(data)
query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s)"""
for frames in range(nz):
    for rows in range(ny):
        for cols in range(nx):
            cursor.execute(query, (frames, rows, cols, data[rows,cols,frames]))

I was reading MySQL for Python, which explained that this wasn't the right approach because executing 4 million separate inserts is very inefficient.

Now my data consist out of a lot of zeros (more than 90% actually), so I threw in an IF statement so I only insert values greater than zero and I used executemany() instead:

query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """
values = []
for frames in range(nz):
    for rows in range(ny):
        for cols in range(nx):
            if data[rows,cols,frames] > 0.0:
                values.append((frames, rows, cols, data[rows,cols,frames]))           
cur.executemany(query, values)

This miraculously brought my processing time down to about 20 seconds, of which 14 seconds spend on creating the list of values (37k rows) and 4 seconds on the actual inserting into the database.

So now I'm wondering, how can I speed up this process any further? Because I have a feeling my loop is horribly inefficient and there has to be a better way. If I need to insert 30 measurements per dog, this would still take 10 minutes, which seems far too long for this amount of data.

Here are two versions of my raw files: with headers or without headers. I'd love to try the LOAD DATA INFILE, but I can't figure out how to parse the data correctly.

like image 665
Ivo Flipse Avatar asked Mar 27 '11 15:03

Ivo Flipse


People also ask

How can we improve the performance of insert query in SQL Server?

We should reduce the number of columns in tables. That means that when more rows can fit on a single data page then that helps boost SQL Server read performance.

Is bulk insert faster than insert?

In case of BULK INSERT, only extent allocations are logged instead of the actual data being inserted. This will provide much better performance than INSERT. The actual advantage, is to reduce the amount of data being logged in the transaction log.

How can I speed up insert in Oracle?

One of the most common ways to improve the performance of an INSERT operation is to use the APPEND optimizer hint. APPEND forces the optimizer to perform a direct path INSERT and appends new values above the high water mark (the end of the table) while new blocks are being allocated.

How can I improve my insert performance?

To optimize insert speed, combine many small operations into a single large operation. Ideally, you make a single connection, send the data for many new rows at once, and delay all index updates and consistency checking until the very end.


2 Answers

the fastest way to insert 4 million rows (16MB of data) would be to use load data infile - http://dev.mysql.com/doc/refman/5.0/en/load-data.html

so if possible generate a csv file then use load data infile..

hope this helps :)

EDIT

So I took one of your original data files rolloff.dat and wrote a quick and dirty program to convert it to the following csv format.

Download frames.dat from here: http://rapidshare.com/files/454896698/frames.dat

Frames.dat

patient_name, sample_date dd/mm/yyyy, frame_time (ms), frame 0..248, row 0..255, col 0..62, value
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,5,39,0.4
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,5,40,0.4
...
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,10,42,0.4
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,10,43,0.4
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,4,40,0.4
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,5,39,0.4
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,5,40,0.7
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,6,44,0.7
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,6,45,0.4
...
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,10,0.4
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,11,0.4
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,12,1.1
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,13,1.4
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,14,0.4

The file contains data only for frames that have values for each row and col - so zeros are excluded. 24799 data rows were generated from your original file.

Next, I created a temporary loading (staging) table into which the frames.dat file is loaded. This is a temporary table which will allow you to manipulate/transform the data before loading into the proper production/reporting tables.

drop table if exists sample_temp;
create table sample_temp
(
patient_name varchar(255) not null,
sample_date date,
frame_time decimal(6,2) not null default 0,
frame_id tinyint unsigned not null,
row_id tinyint unsigned not null,
col_id tinyint unsigned not null,
value decimal(4,1) not null default 0,
primary key (frame_id, row_id, col_id)
)
engine=innodb;

All that remains is to load the data (note: i am using windows so you'll have to edit this script to make it linux compatible - check pathnames and change '\r\n' to '\n')

truncate table sample_temp;

start transaction;

load data infile 'c:\\import\\frames.dat' 
into table sample_temp
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\r\n'
ignore 1 lines
(
patient_name,
@sample_date,
frame_time,
frame_id,
row_id,
col_id,
value
)
set 
sample_date = str_to_date(@sample_date,'%d/%m/%Y');

commit;

Query OK, 24799 rows affected (1.87 sec)
Records: 24799  Deleted: 0  Skipped: 0  Warnings: 0

The 24K rows were loaded in 1.87 seconds.

Hope this helps :)

like image 199
Jon Black Avatar answered Oct 31 '22 15:10

Jon Black


If the data is a numpy array, you can try this:

query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """
values = []
rows, cols, frames = numpy.nonzero(data)
for row, col, frame in zip(rows, cols, frames):
    values.append((frame, row, col, data[row,col,frame]))

cur.executemany(query, values)

or

query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """
rows, cols, frames = numpy.nonzero(data)
values = [(row, col, frame, val) for row, col, frame, val in zip(rows, cols, frames, data[rows,cols,frames])]
cur.executemany(query, values)

Hope it helps

like image 45
Hernan Avatar answered Oct 31 '22 17:10

Hernan