Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the same SQLite query being 30 times slower when fetching only twice as many results?

I have been working on speeding up a query I'm using for about a week now and asked several questions about it here ( How can I speed up fetching the results after running an sqlite query?, Is it normal that sqlite.fetchall() is so slow?, How to use min() and max() in an efficient way?).

With the very useful help from the answers given there I managed to get the time down to the sqlite query taking 100.95 seconds and fetchall taking: 1485.43. This was still not enough, so after trying out some different indexes I managed to get the query time down to 0.08 seconds for one sample and the fetchall time down to 54.97 seconds. So I thought I finally managed to speed things up enough.

Then the query runs for the next sample, taking 0.58 seconds, and the fetchall taking 3952.80 seconds. For the third sample the query took 1.01 seconds and took 1970.67 seconds to fetchall.

The first sample fetched 12951 rows, the second sample 24972 rows and the third 6470 rows. I'm very curious why the first sample was so much faster to fetch the rows, when it had only about half the amount to fetch as the second example.


Code (spectrumFeature_inputValues is (1,), (2,) and (3,), from the 3 samples used.):

self.cursor.execute('begin') self.cursor.execute("EXPLAIN QUERY PLAN "+                     "SELECT precursor_id, feature_table_id "+                     "FROM `MSMS_precursor` "+                     "INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+                     "INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+                     "WHERE spectrum.scan_start_time BETWEEN feature.rtMin AND feature.rtMax "+                     "AND MSMS_precursor.ion_mz BETWEEN feature.mzMin AND feature.mzMax "+                     "AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues) print 'EXPLAIN QUERY PLAN: ' print self.cursor.fetchall() import time time0 = time.time() self.cursor.execute("SELECT precursor_id, feature_table_id "+                     "FROM `MSMS_precursor` "+                     "INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+                     "INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+                     "WHERE spectrum.scan_start_time BETWEEN feature.rtMin AND feature.rtMax "+                     "AND MSMS_precursor.ion_mz BETWEEN feature.mzMin AND feature.mzMax "+                     "AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues) print 'query took:',time.time()-time0,'seconds' time0 = time.time() precursorFeatureIds = self.cursor.fetchall() print 'it fetched:',len(precursorFeatureIds),'rows' print 'fetchall took',time.time()-time0,'seconds' time0 = time.time() for precursorAndFeatureID in precursorFeatureIds:     feature_has_MSMS_precursor_inputValues = (precursorAndFeatureID[0], precursorAndFeatureID[1])     self.cursor.execute("INSERT INTO `feature_has_MSMS_precursor` VALUES(?,?)", feature_has_MSMS_precursor_inputValues) print 'inserting took',time.time()-time0,'seconds' self.connection.commit() 

and the results:

EXPLAIN QUERY PLAN:  [(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')] query took: 0.0754859447479 seconds it fetched: 12951 rows fetchall took 54.2855291367 seconds inserting took 0.602859973907 seconds It took 54.9704811573 seconds  EXPLAIN QUERY PLAN:  [(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')] query took: 0.579694032669 seconds it fetched: 24972 rows fetchall took 3950.08093309 seconds inserting took 2.11575508118 seconds  It took 3952.80745602 seconds  EXPLAIN QUERY PLAN:  [(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')] query took: 1.01185703278 seconds it fetched: 6470 rows fetchall took 1970.622962 seconds inserting took 0.673867940903 seconds It took 1972.31343699 seconds 

SQLite create statements:

-- ----------------------------------------------------- -- Table `feature` -- ----------------------------------------------------- CREATE  TABLE IF NOT EXISTS `feature` (   `feature_table_id` INT PRIMARY KEY NOT NULL ,   `feature_id` VARCHAR(40) NOT NULL ,   `intensity` DOUBLE NOT NULL ,   `overallquality` DOUBLE NOT NULL ,   `charge` INT NOT NULL ,   `content` VARCHAR(45) NOT NULL ,   `intensity_cutoff` DOUBLE NOT NULL,   `mzMin` DOUBLE NULL ,   `mzMax` DOUBLE NULL ,   `rtMin` DOUBLE NULL ,   `rtMax` DOUBLE NULL ,   `msrun_msrun_id` INT NOT NULL ,   CONSTRAINT `fk_feature_msrun1`     FOREIGN KEY (`msrun_msrun_id` )     REFERENCES `msrun` (`msrun_id` )     ON DELETE NO ACTION     ON UPDATE NO ACTION);    CREATE INDEX `fk_mzMin_feature` ON `feature` (`mzMin` ASC);    CREATE INDEX `fk_mzMax_feature` ON `feature` (`mzMax` ASC);    CREATE INDEX `fk_rtMin_feature` ON `feature` (`rtMin` ASC);    CREATE INDEX `fk_rtMax_feature` ON `feature` (`rtMax` ASC);  DROP TABLE IF EXISTS `spectrum`; -- ----------------------------------------------------- -- Table `spectrum` -- ----------------------------------------------------- CREATE  TABLE IF NOT EXISTS `spectrum` (   `spectrum_id` INT PRIMARY KEY NOT NULL ,   `spectrum_index` INT NOT NULL ,   `ms_level` INT NOT NULL ,   `base_peak_mz` DOUBLE NOT NULL ,   `base_peak_intensity` DOUBLE NOT NULL ,   `total_ion_current` DOUBLE NOT NULL ,   `lowest_observes_mz` DOUBLE NOT NULL ,   `highest_observed_mz` DOUBLE NOT NULL ,   `scan_start_time` DOUBLE NOT NULL ,   `ion_injection_time` DOUBLE,   `binary_data_mz` BLOB NOT NULL,   `binary_data_rt` BLOB NOT NULL,   `msrun_msrun_id` INT NOT NULL ,   CONSTRAINT `fk_spectrum_msrun1`     FOREIGN KEY (`msrun_msrun_id` )     REFERENCES `msrun` (`msrun_id` )     ON DELETE NO ACTION     ON UPDATE NO ACTION);  CREATE INDEX `fk_spectrum_spectrum_id_1` ON  `spectrum` (`spectrum_id` ASC); CREATE INDEX `fk_spectrum_scahn_start_time_1` ON  `spectrum` (`scan_start_time` ASC);  DROP TABLE IF EXISTS `feature_has_MSMS_precursor`; -- ----------------------------------------------------- -- Table `spectrum_has_feature` -- ----------------------------------------------------- CREATE  TABLE IF NOT EXISTS `feature_has_MSMS_precursor` (   `MSMS_precursor_precursor_id` INT NOT NULL ,   `feature_feature_table_id` INT NOT NULL ,   CONSTRAINT `fk_spectrum_has_feature_spectrum1`     FOREIGN KEY (`MSMS_precursor_precursor_id` )     REFERENCES `MSMS_precursor` (`precursor_id` )     ON DELETE NO ACTION     ON UPDATE NO ACTION,   CONSTRAINT `fk_spectrum_has_feature_feature1`     FOREIGN KEY (`feature_feature_table_id` )     REFERENCES `feature` (`feature_table_id` )     ON DELETE NO ACTION     ON UPDATE NO ACTION);    CREATE INDEX `fk_feature_has_MSMS_precursor_feature1` ON `feature_has_MSMS_precursor` (`feature_feature_table_id` ASC);   CREATE INDEX `fk_feature_has_MSMS_precursor_precursor1` ON `feature_has_MSMS_precursor` (`MSMS_precursor_precursor_id` ASC); 

As you can see I have made indexes out of the mz and rt values in both spectrum and feature, because I figured that most time is spent comparing those numbers together.

So why is the first sample so much faster than the second and third? And how does the query time relate to the fetchall time? Most importantly, is there a way I can speed this up?


Update 1:

After talking to a collegaue it's probably because comparing a point to a 2d dimension (the rtMin, rtMax, mzMin, mzMax) will take n^2 time. This roughly corresponds to the second fetchall taking a bit more than 60^2 seconds (aproximate time the first fetchall took) and it retrieved a little less than twice the amount of rows. This doesn't answer any of my questions though.


Update 2:

I tried using R*tree as advised in the comments. I made a new table:

CREATE VIRTUAL TABLE convexhull_edges USING rtree(    feature_feature_table_id,                 rtMin, rtMax,          mzMin, mzMax,        );  

and change my query to:

self.cursor.execute("SELECT precursor_id, feature_table_id "+                     "FROM `MSMS_precursor` "+                     "INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+                     "INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+                     "INNER JOIN `convexhull_edges` ON convexhull_edges.feature_feature_table_id = feature.feature_table_id "                     "WHERE spectrum.scan_start_time BETWEEN convexhull_edges.rtMin AND convexhull_edges.rtMax "+                     "AND MSMS_precursor.ion_mz BETWEEN convexhull_edges.mzMin AND convexhull_edges.mzMax "+                     "AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues) 

This gave the following results:

EXPLAIN QUERY PLAN:  [(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')] query took: 0.0572800636292 seconds it fetched: 13140 rows fetchall took 34.4445540905 seconds  EXPLAIN QUERY PLAN:  [(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')] query took: 0.819370031357 seconds it fetched: 25402 rows fetchall took 3625.72873998 seconds  EXPLAIN QUERY PLAN:  [(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')] query took: 0.878498077393 seconds it fetched: 6761 rows fetchall took 1419.34246588 seconds inserting took 0.340960025787 seconds It took 1420.56637716 seconds 

So a bit faster than my previous way, but still not fast enough. Next I'm going to try web_bod's solution.


Update 3

Using web_bod's solution I got the following times:

EXPLAIN QUERY PLAN:  [(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')] query took: 0.0521960258484 seconds it fetched: 13052 rows fetchall took 90.5810132027 seconds  EXPLAIN QUERY PLAN:  [(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum  USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')] query took: 0.278959989548 seconds it fetched: 25195 rows fetchall took 4310.6012361 seconds 

The third one sadly didn't finish because of a reboot. So this is a bit faster than my first solution, but slower than using R*Tree


Update 4

Working on a different query which was going incredibly slow I saw that it was going into an uninterrupted sleep (see this question). So I checked top while running this query and it's switching between R and D state, lowering the CPU usage from 100 to 50%. This could be why it's running so slow with all the solutions provided.


Update 5

I migrated to MySQL but I'm getting the same results.

like image 456
Niek de Klein Avatar asked May 10 '12 10:05

Niek de Klein


People also ask

How long does a SQLite query take?

Query one takes about 30ms, but 150ms to fetch the data from the database. Query two takes about 3ms -this is the one I therefore prefer-, but also takes 170ms to fetch the data.

What is Fetchall in sqlite3?

You can fetch data from MYSQL using the fetch() method provided by the sqlite python module. The sqlite3. Cursor class provides three methods namely fetchall(), fetchmany() and, fetchone() where, The fetchall() method retrieves all the rows in the result set of a query and returns them as list of tuples.


1 Answers

The execution time geometrically proportional to the number of rows in each table rather than arithmetically e.g.

3 tables with 10 rows each => 1,000 comparision  3 tables with 10, 10 and 40 rows => 4,000 comparisons  3 tables with 20 rows each => 8,000 comparisons 

You could probably re-factor the query to avoid some of the joins/cursors - when do you need an answer?

Could you do something like this:

SELECT precursor_id, feature_table_id  FROM MSMS_precursor INNER JOIN       (         SELECT mzMin, mzMax, rtMin, rtMax, spectrum_id, feature_table_id, msrun_msrun_id          FROM spectrum         INNER JOIN              (select feature_table_id, mzMin, mzMax, rtMin, rtMax, msrun_msrun_id             from feature             where feature.msrun_msrun_id = 'value'            ) subquery           ON subquery.msrun_msrun_id = spectrum.msrun_msrun_id         WHERE              spectrum.scan_start_time BETWEEN subquery.rtMin AND subquery.rtMax      ) subquery      ON subquery.spectrum_id = MSMS_precursor.spectrum_spectrum_id   WHERE      MSMS_precursor.ion_mz BETWEEN subquery.mzMin AND subquery.mzMax  

Using a subquery enables you to reduce the number of comparisons between the tables - you can quickly filter out the unwanted features, then the un-related spectra before searching for suitable precursors.

I don't use SQLLite - but the principle should still apply.

UPDATED : fixed bug in SQL

Notes:

You don't have to worry about the ANDs, you'll only get:

  • features where feature.msrun_msrun_id = 'value'
  • spectra for those features and where spectrum.scan_start_time BETWEEN subquery.rtMin AND subquery.rtMax
  • precursors for those spectrs and where MSMS_precursor.ion_mz BETWEEN subquery.mzMin AND subquery.mzMax

UPDATE 18/May:

It's the indexing!!! you have indexes on the search fields, but not on the fields participating in the joins - foreign key indices really boost performance:

CREATE INDEX `fk_msrun_msrun_id_feature` ON `feature` (`msrun_msrun_id` ASC);  CREATE INDEX `fk_spectrum_spectrum_id_feature` ON `feature` (`msrun_msrun_id` ASC);  CREATE INDEX `fk_spectrum_spectrum_id_MSMS_precursor` ON `MSMS_precursor` (`spectrum_spectrum_id` ASC);  
like image 105
web_bod Avatar answered Oct 04 '22 13:10

web_bod