Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to otimize select from several tables with millions of rows

Have the following tables (Oracle 10g):

catalog (
  id NUMBER PRIMARY KEY,
  name VARCHAR2(255),
  owner NUMBER,
  root NUMBER REFERENCES catalog(id)
  ...
)
university (
  id NUMBER PRIMARY KEY,
  ...
)
securitygroup (
  id NUMBER PRIMARY KEY
  ...
)
catalog_securitygroup (
  catalog REFERENCES catalog(id),
  securitygroup REFERENCES securitygroup(id)
)
catalog_university (
  catalog REFERENCES catalog(id),
  university REFERENCES university(id)
)

Catalog: 500 000 rows, catalog_university: 500 000, catalog_securitygroup: 1 500 000.

I need to select any 50 rows from catalog with specified root ordered by name for current university and current securitygroup. There is a query:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c, catalog_securitygroup cs, catalog_university cu
        WHERE c.root = 100
          AND cs.catalog = c.id
          AND cs.securitygroup = 200
          AND cu.catalog = c.id
          AND cu.university = 300
        ORDER BY name
    ) cc 
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

Where 100 - some catalog, 200 - some securitygroup, 300 - some university. This query return 50 rows from ~ 170 000 in 3 minutes.

But next query return this rows in 2 sec:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c
        WHERE c.root = 100
        ORDER BY name
    ) cc 
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

I build next indexes: (catalog.id, catalog.name, catalog.owner), (catalog_securitygroup.catalog, catalog_securitygroup.index), (catalog_university.catalog, catalog_university.university).

Plan for first query (using PLSQL Developer):

http://habreffect.ru/66c/f25faa5f8/plan2.jpg

Plan for second query:

http://habreffect.ru/f91/86e780cc7/plan1.jpg

What are the ways to optimize the query I have?

like image 434
Anton Schukin Avatar asked Nov 17 '10 15:11

Anton Schukin


Video Answer


2 Answers

The indexes that can be useful and should be considered deal with

WHERE c.root = 100
      AND cs.catalog = c.id
      AND cs.securitygroup = 200
      AND cu.catalog = c.id
      AND cu.university = 300

So the following fields can be interesting for indexes

c: id, root   
cs: catalog, securitygroup   
cu: catalog, university

So, try creating

(catalog_securitygroup.catalog, catalog_securitygroup.securitygroup)

and

(catalog_university.catalog, catalog_university.university)

EDIT: I missed the ORDER BY - these fields should also be considered, so

(catalog.name, catalog.id)

might be beneficial (or some other composite index that could be used for sorting and the conditions - possibly (catalog.root, catalog.name, catalog.id))

EDIT2 Although another question is accepted I'll provide some more food for thought. I have created some test data and run some benchmarks.

The test cases are minimal in terms of record width (in catalog_securitygroup and catalog_university the primary keys are (catalog, securitygroup) and (catalog, university)). Here is the number of records per table:

test=# SELECT (SELECT COUNT(*) FROM catalog), (SELECT COUNT(*) FROM catalog_securitygroup), (SELECT COUNT(*) FROM catalog_university);
 ?column? | ?column? | ?column? 
----------+----------+----------
   500000 |  1497501 |   500000
(1 row)

Database is postgres 8.4, default ubuntu install, hardware i5, 4GRAM

First I rewrote the query to

SELECT c.id, c.name, c.owner
FROM catalog c, catalog_securitygroup cs, catalog_university cu
WHERE c.root < 50 
  AND cs.catalog = c.id 
  AND cu.catalog = c.id
  AND cs.securitygroup < 200
  AND cu.university < 200
ORDER BY c.name
LIMIT 50 OFFSET 100

note: the conditions are turned into less then to maintain comparable number of intermediate rows (the above query would return 198,801 rows without the LIMIT clause)

If run as above, without any extra indexes (save for PKs and foreign keys) it runs in 556 ms on a cold database (this is actually indication that I oversimplified the sample data somehow - I would be happier if I had 2-4s here without resorting to less then operators)

This bring me to my point - any straight query that only joins and filters (certain number of tables) and returns only a certain number of the records should run under 1s on any decent database without need to use cursors or to denormalize data (one of these days I'll have to write a post on that).

Furthermore, if a query is returning only 50 rows and does simple equality joins and restrictive equality conditions it should run even much faster.

Now let's see if I add some indexes, the biggest potential in queries like this is usually the sort order, so let me try that:

CREATE INDEX test1 ON catalog (name, id);

This makes execution time on the query - 22ms on a cold database.

And that's the point - if you are trying to get only a page of data, you should only get a page of data and execution times of queries such as this on normalized data with proper indexes should take less then 100ms on decent hardware.

I hope I didn't oversimplify the case to the point of no comparison (as I stated before some simplification is present as I don't know the cardinality of relationships between catalog and the many-to-many tables).

So, the conclusion is

  • if I were you I would not stop tweaking indexes (and the SQL) until I get the performance of the query to go below 200ms as rule of the thumb.
  • only if I would find an objective explanation why it can't go below such value I would resort to denormalisation and/or cursors, etc...
like image 179
Unreason Avatar answered Oct 06 '22 04:10

Unreason


First I assume that your University and SecurityGroup tables are rather small. You posted the size of the large tables but it's really the other sizes that are part of the problem

Your problem is from the fact that you can't join the smallest tables first. Your join order should be from small to large. But because your mapping tables don't include a securitygroup-to-university table, you can't join the smallest ones first. So you wind up starting with one or the other, to a big table, to another big table and then with that large intermediate result you have to go to a small table.

If you always have current_univ and current_secgrp and root as inputs you want to use them to filter as soon as possible. The only way to do that is to change your schema some. In fact, you can leave the existing tables in place if you have to but you'll be adding to the space with this suggestion.

You've normalized the data very well. That's great for speed of update... not so great for querying. We denormalize to speed querying (that's the whole reason for datawarehouses (ok that and history)). Build a single mapping table with the following columns.

Univ_id, SecGrp_ID, Root, catalog_id. Make it an index organized table of the first 3 columns as pk.

Now when you query that index with all three PK values, you'll finish that index scan with a complete list of allowable catalog Id, now it's just a single join to the cat table to get the cat item details and you're off an running.

like image 42
Stephanie Page Avatar answered Oct 06 '22 04:10

Stephanie Page