Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternative to COUNT for innodb to prevent table scan?

I've managed to put together a query that works for my needs, albeit more complicated than I was hoping. But, for the size of tables the query is slower than it should be (0.17s). The reason, based on the EXPLAIN provided below, is because there is a table scan on the meta_relationships table due to it having the COUNT in the WHERE clause on an innodb engine.

Query:

SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND meta_relationships.object_id 
NOT IN (SELECT meta_relationships.object_id FROM meta_relationships
        GROUP BY meta_relationships.object_id HAVING count(*) > 1)
GROUP BY meta_relationships.object_id

This particular query, selects posts which have ONLY the computers category. The purpose of count > 1 is to exclude posts that contain computers/hardware, computers/software, etc. The more categories that are selected, the higher the count would be.

Ideally, I'd like to get it functioning like this:

WHERE meta.meta_name IN ('computers') AND meta_relationships.meta_order IN (0)

or

WHERE meta.meta_name IN ('computers','software') 
AND meta_relationships.meta_order IN (0,1)

etc..

But unfortunately this doesn't work, because it doesn't take into consideration that there may be a meta_relationships.meta_order = 2.

I've tried...

WHERE meta.meta_name IN ('computers')
GROUP BY meta_relationships.meta_order
HAVING meta_relationships.meta_order IN (0) AND meta_relationships.meta_order NOT IN (1)

but it doesn't return the correct amount of rows.

EXPLAIN:

id  select_type   table               type    possible_keys          key               key_len ref                                   rows   Extra   
1   PRIMARY       meta                ref     PRIMARY,idx_meta_name  idx_meta_name     602     const                                 1      Using where; Using index; Using temporary; Using filesort
1   PRIMARY       meta_data           ref     PRIMARY,idx_meta_id    idx_meta_id       8       database.meta.meta_id                 1  
1   PRIMARY       meta_relationships  ref     idx_meta_data_id       idx_meta_data_id  8       database.meta_data.meta_data_id       11     Using where
1   PRIMARY       posts               eq_ref  PRIMARY                PRIMARY           4       database.meta_relationships.object_id 1  
2   MATERIALIZED  meta_relationships  index   NULL                   idx_object_id     4       NULL                                  14679  Using index

Tables/Indexes:
meta
This table contains the category and tag names.
indexes:
PRIMARY KEY (meta_id), KEY idx_meta_name (meta_name)
meta_data
This table contains additional data about the categories and tags such as type (category or tag), description, parent, count.
indexes:
PRIMARY KEY (meta_data_id), KEY idx_meta_id (meta_id)
meta_relationships
This is a junction/lookup table. It contains a foreign key to the posts_id, a foreign key to the meta_data_id, and also contains the order of the categories.
indexes:
PRIMARY KEY (relationship_id), KEY idx_object_id (object_id), KEY idx_meta_data_id (meta_data_id)

  • The count allows me to only select the posts with that correct level of category. For example, the category computers has posts with only the computers category but it also has posts with computers/hardware. The count filters out posts that contain those extra categories. I hope that makes sense.
  • I believe the key to optimizing the query is to get away completely from doing the COUNT.
  • An alternative to the COUNT would possibly be using meta_relationships.meta_order or meta_data.parent instead.
  • The meta_relationships table will grow quickly and with the current size (~15K rows) I'm hoping to achieve an execution time in the 100th of seconds rather than the 10ths of seconds.
  • Since there needs to be multiple conditions in the WHERE clause for each category/tag, any answer optimized for a dynamic query is preferred.
  • I have created an IDE with sample data.

How can I optimize this query?

EDIT :

I was never able to find an optimal solution to this problem. It was really a combination of smcjones recommendation of improving the indexes for which I would recommend doing an EXPLAIN and looking at EXPLAIN Output Format then change the indexes to whatever gives you the best performance.
Also, hpf's recommendation to add another column with the total count helped tremendously. In the end, after changing the indexes, I ended up going with this query.

SELECT posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE posts.meta_count = 2
GROUP BY posts.post_id
HAVING category = 'category,subcategory'

After getting rid of the COUNT, the big performance killer was the GROUP BY and ORDER BY, but the indexes are your best friend. I learned that when doing a GROUP BY, the WHERE clause is very important, the more specific you can get the better.

like image 530
EternalHour Avatar asked Apr 04 '15 07:04

EternalHour


4 Answers

With a combination of optimized queries AND optimizing your tables, you will have fast queries. However, you cannot have fast queries without an optimized table.

I cannot stress this enough: If your tables are structured correctly with the correct amount of indexes, you should not be experiencing any full table reads on a query like GROUP BY... HAVING unless you do so by design.

Based on your example, I have created this SQLFiddle.

Compare that to SQLFiddle #2, in which I added indexes and added a UNIQUE index against meta.meta_naame.

From my testing, Fiddle #2 is faster.

Optimizing Your Query

This query was driving me nuts, even after I made the argument that indexes would be the best way to optimize this. Even though I still hold that the table is your biggest opportunity to increase performance, it did seem that there had to be a better way to run this query in MySQL. I had a revelation after sleeping on this problem, and used the following query (seen in SQLFiddle #3):

SELECT posts.post_id,posts.post_name,posts.post_title,posts.post_description,posts.date,meta.meta_name
   FROM posts
   LEFT JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
   LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
   LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
   WHERE meta.meta_name = 'animals'
   GROUP BY meta_relationships.object_id
   HAVING sum(meta_relationships.object_id) = min(meta_relationships.object_id);

HAVING sum() = min() on a GROUP BY should check to see if there is more than one record of each type. Obviously, each time the record shows up, it will add more to the sum. (Edit: On subsequent tests it seems like this has the same impact as count(meta_relationships.object_id) = 1. Oh well, the point is I believe you can remove subquery and have the same result).

I want to be clear that you won't notice much if any optimization on the query I provided you unless the section, WHERE meta.meta_name = 'animals' is querying against an index (preferably a unique index because I doubt you'll need more than one of these and it will prevent accidental duplication of data).

So, instead of a table that looks like this:

CREATE TABLE meta_data (
  meta_data_id BIGINT,
  meta_id BIGINT,
  type VARCHAR(50),
  description VARCHAR(200),
  parent BIGINT,
  count BIGINT);

You should make sure you add primary keys and indexes like this:

CREATE TABLE meta_data (
  meta_data_id BIGINT,
  meta_id BIGINT,
  type VARCHAR(50),
  description VARCHAR(200),
  parent BIGINT,
  count BIGINT,
  PRIMARY KEY (meta_data_id,meta_id),
  INDEX ix_meta_id (meta_id)
);

Don't overdo it, but every table should have a primary key, and any time you are aggregating or querying against a specific value, there should be indexes.

When indexes are not used, the MySQL will walk through each row of the table until it finds what you want. In such a limited example as yours this doesn't take too long (even though it's still noticeably slower), but when you add thousands or more records, this will become extraordinarily painful.

In the future, when reviewing your queries, try to identify where your full table scans are occurring and see if there is an index on that column. A good place to start is wherever you are aggregating or using the WHERE syntax.

A note on the count column

I have not found putting count columns into the table to be helpful. It can lead to some pretty serious integrity issues. If a table is properly optimized, It should be very easy to use count() and get the current count. If you want to have it in a table, you can use a VIEW, although that will not be the most efficient way to make the pull.

The problem with putting count columns into a table is that you need to update that count, using either a TRIGGER or, worse, application logic. As your program scales out that logic can either get lost or buried. Adding that column is a deviation from normalization and when something like this is to occur, there should be a VERY good reason.

Some debate exists as to whether there is ever a good reason to do this, but I think I'd be wise to stay out of that debate because there are great arguments on both sides. Instead, I will pick a much smaller battle and say that I see this causing you more headaches than benefits in this use case, so it is probably worth A/B testing.

like image 167
smcjones Avatar answered Oct 11 '22 00:10

smcjones


Since the HAVING seems to be the issue, can you instead create a flag field in the posts table and use that instead? If I understand the query correctly, you're trying to find posts with only one meta_relationship link. If you created a field in your posts table that was either a count of the meta_relationships for that post, or a boolean flag for whether there was only one, and indexed it of course, that would probably be much faster. It would involve updating the field if the post was edited.

So, consider this:

Add a new field to the posts table called "num_meta_rel". It can be an unsigned tinyint as long as you'll never have more than 255 tags to any one post.

Update the field like this:

UPDATE posts
SET num_meta_rel=(SELECT COUNT(object_id) from meta_relationships WHERE object_id=posts.post_id);

This query will take some time to run, but once done you have all the counts precalculated. Note this can be done better with a join, but SQLite (Ideone) only allows subqueries.

Now, you rewrite your query like this:

SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND posts.num_meta_rel=1
GROUP BY meta_relationships.object_id

If I've done this correctly, the runnable code is here: http://ideone.com/ZZiKgx

Note that this solution requires that you update the num_meta_rel (choose a better name, that one is terrible...) if the post has a new tag associated with it. But that should be much faster than scanning your entire table over and over.

like image 27
hpf Avatar answered Oct 11 '22 01:10

hpf


See if this gives you the right answer, possibly faster:

SELECT  p.post_id, p.post_name,
        GROUP_CONCAT(IF(md.type = 'category', meta.meta_name, null)) AS category,
        GROUP_CONCAT(IF(md.type = 'tag', meta.meta_name, null)) AS tag
    FROM  
      ( SELECT  object_id
            FROM  meta_relation
            GROUP BY  object_id
            HAVING  count(*) = 1 
      ) AS x
    JOIN  meta_relation AS mr ON mr.object_id = x.object_id
    JOIN  posts AS p ON p.post_id = mr.object_id
    JOIN  meta_data AS md ON mr.meta_data_id = md.meta_data_id
    JOIN  meta ON md.meta_id = meta.meta_id
    WHERE  meta.meta_name = ?
    GROUP BY  mr.object_id 
like image 35
Rick James Avatar answered Oct 11 '22 01:10

Rick James


Unfortunately I have no possibility to test performance,

But try my query using your real data:

http://sqlfiddle.com/#!9/81b29/13

SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
INNER JOIN (
  SELECT meta_relationships.object_id
   FROM meta_relationships 
   GROUP BY meta_relationships.object_id 
   HAVING count(*) < 3
  ) mr ON mr.object_id = posts.post_id
LEFT JOIN meta_relationships ON mr.object_id = meta_relationships.object_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
INNER JOIN (
  SELECT * 
  FROM meta
  WHERE  meta.meta_name = 'health'
  ) meta ON meta_data.meta_id = meta.meta_id
GROUP BY posts.post_id
like image 33
Alex Avatar answered Oct 11 '22 01:10

Alex