I've managed to put together a query that works for my needs, albeit more complicated than I was hoping. But, for the size of tables the query is slower than it should be (0.17s). The reason, based on the EXPLAIN
provided below, is because there is a table scan on the meta_relationships
table due to it having the COUNT
in the WHERE
clause on an innodb
engine.
Query:
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND meta_relationships.object_id
NOT IN (SELECT meta_relationships.object_id FROM meta_relationships
GROUP BY meta_relationships.object_id HAVING count(*) > 1)
GROUP BY meta_relationships.object_id
This particular query, selects posts which have ONLY the computers
category. The purpose of count > 1
is to exclude posts that contain computers/hardware
, computers/software
, etc. The more categories that are selected, the higher the count would be.
Ideally, I'd like to get it functioning like this:
WHERE meta.meta_name IN ('computers') AND meta_relationships.meta_order IN (0)
or
WHERE meta.meta_name IN ('computers','software')
AND meta_relationships.meta_order IN (0,1)
etc..
But unfortunately this doesn't work, because it doesn't take into consideration that there may be a meta_relationships.meta_order
= 2.
I've tried...
WHERE meta.meta_name IN ('computers')
GROUP BY meta_relationships.meta_order
HAVING meta_relationships.meta_order IN (0) AND meta_relationships.meta_order NOT IN (1)
but it doesn't return the correct amount of rows.
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY meta ref PRIMARY,idx_meta_name idx_meta_name 602 const 1 Using where; Using index; Using temporary; Using filesort
1 PRIMARY meta_data ref PRIMARY,idx_meta_id idx_meta_id 8 database.meta.meta_id 1
1 PRIMARY meta_relationships ref idx_meta_data_id idx_meta_data_id 8 database.meta_data.meta_data_id 11 Using where
1 PRIMARY posts eq_ref PRIMARY PRIMARY 4 database.meta_relationships.object_id 1
2 MATERIALIZED meta_relationships index NULL idx_object_id 4 NULL 14679 Using index
Tables/Indexes:
meta
This table contains the category and tag names.
indexes:
PRIMARY KEY (meta_id
), KEY idx_meta_name
(meta_name
)
meta_data
This table contains additional data about the categories and tags such as type (category or tag), description, parent, count.
indexes:
PRIMARY KEY (meta_data_id
), KEY idx_meta_id
(meta_id
)
meta_relationships
This is a junction/lookup table. It contains a foreign key to the posts_id, a foreign key to the meta_data_id, and also contains the order of the categories.
indexes:
PRIMARY KEY (relationship_id
), KEY idx_object_id
(object_id
), KEY idx_meta_data_id
(meta_data_id
)
COUNT
.COUNT
would possibly be using meta_relationships.meta_order
or meta_data.parent
instead.meta_relationships
table will grow quickly and with the current size (~15K rows) I'm hoping to achieve an execution time in the 100th of seconds rather than the 10ths of seconds.WHERE
clause for each category/tag, any answer optimized for a dynamic query is preferred.How can I optimize this query?
EDIT :
I was never able to find an optimal solution to this problem. It was really a combination of smcjones recommendation of improving the indexes for which I would recommend doing an EXPLAIN
and looking at EXPLAIN Output Format then change the indexes to whatever gives you the best performance.
Also, hpf's recommendation to add another column with the total count helped tremendously. In the end, after changing the indexes, I ended up going with this query.
SELECT posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE posts.meta_count = 2
GROUP BY posts.post_id
HAVING category = 'category,subcategory'
After getting rid of the COUNT
, the big performance killer was the GROUP BY
and ORDER BY
, but the indexes are your best friend. I learned that when doing a GROUP BY
, the WHERE
clause is very important, the more specific you can get the better.
With a combination of optimized queries AND optimizing your tables, you will have fast queries. However, you cannot have fast queries without an optimized table.
I cannot stress this enough: If your tables are structured correctly with the correct amount of indexes, you should not be experiencing any full table reads on a query like GROUP BY... HAVING unless you do so by design.
Based on your example, I have created this SQLFiddle.
Compare that to SQLFiddle #2, in which I added indexes and added a UNIQUE
index against meta.meta_naame
.
From my testing, Fiddle #2 is faster.
This query was driving me nuts, even after I made the argument that indexes would be the best way to optimize this. Even though I still hold that the table is your biggest opportunity to increase performance, it did seem that there had to be a better way to run this query in MySQL. I had a revelation after sleeping on this problem, and used the following query (seen in SQLFiddle #3):
SELECT posts.post_id,posts.post_name,posts.post_title,posts.post_description,posts.date,meta.meta_name
FROM posts
LEFT JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = 'animals'
GROUP BY meta_relationships.object_id
HAVING sum(meta_relationships.object_id) = min(meta_relationships.object_id);
HAVING sum() = min()
on a GROUP BY
should check to see if there is more than one record of each type. Obviously, each time the record shows up, it will add more to the sum. (Edit: On subsequent tests it seems like this has the same impact as count(meta_relationships.object_id) = 1
. Oh well, the point is I believe you can remove subquery and have the same result).
I want to be clear that you won't notice much if any optimization on the query I provided you unless the section, WHERE meta.meta_name = 'animals'
is querying against an index (preferably a unique index because I doubt you'll need more than one of these and it will prevent accidental duplication of data).
So, instead of a table that looks like this:
CREATE TABLE meta_data (
meta_data_id BIGINT,
meta_id BIGINT,
type VARCHAR(50),
description VARCHAR(200),
parent BIGINT,
count BIGINT);
You should make sure you add primary keys and indexes like this:
CREATE TABLE meta_data (
meta_data_id BIGINT,
meta_id BIGINT,
type VARCHAR(50),
description VARCHAR(200),
parent BIGINT,
count BIGINT,
PRIMARY KEY (meta_data_id,meta_id),
INDEX ix_meta_id (meta_id)
);
Don't overdo it, but every table should have a primary key, and any time you are aggregating or querying against a specific value, there should be indexes.
When indexes are not used, the MySQL will walk through each row of the table until it finds what you want. In such a limited example as yours this doesn't take too long (even though it's still noticeably slower), but when you add thousands or more records, this will become extraordinarily painful.
In the future, when reviewing your queries, try to identify where your full table scans are occurring and see if there is an index on that column. A good place to start is wherever you are aggregating or using the WHERE
syntax.
count
columnI have not found putting count
columns into the table to be helpful. It can lead to some pretty serious integrity issues. If a table is properly optimized, It should be very easy to use count()
and get the current count. If you want to have it in a table, you can use a VIEW
, although that will not be the most efficient way to make the pull.
The problem with putting count
columns into a table is that you need to update that count, using either a TRIGGER
or, worse, application logic. As your program scales out that logic can either get lost or buried. Adding that column is a deviation from normalization and when something like this is to occur, there should be a VERY good reason.
Some debate exists as to whether there is ever a good reason to do this, but I think I'd be wise to stay out of that debate because there are great arguments on both sides. Instead, I will pick a much smaller battle and say that I see this causing you more headaches than benefits in this use case, so it is probably worth A/B testing.
Since the HAVING seems to be the issue, can you instead create a flag field in the posts table and use that instead? If I understand the query correctly, you're trying to find posts with only one meta_relationship link. If you created a field in your posts table that was either a count of the meta_relationships for that post, or a boolean flag for whether there was only one, and indexed it of course, that would probably be much faster. It would involve updating the field if the post was edited.
So, consider this:
Add a new field to the posts table called "num_meta_rel". It can be an unsigned tinyint as long as you'll never have more than 255 tags to any one post.
Update the field like this:
UPDATE posts
SET num_meta_rel=(SELECT COUNT(object_id) from meta_relationships WHERE object_id=posts.post_id);
This query will take some time to run, but once done you have all the counts precalculated. Note this can be done better with a join, but SQLite (Ideone) only allows subqueries.
Now, you rewrite your query like this:
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND posts.num_meta_rel=1
GROUP BY meta_relationships.object_id
If I've done this correctly, the runnable code is here: http://ideone.com/ZZiKgx
Note that this solution requires that you update the num_meta_rel (choose a better name, that one is terrible...) if the post has a new tag associated with it. But that should be much faster than scanning your entire table over and over.
See if this gives you the right answer, possibly faster:
SELECT p.post_id, p.post_name,
GROUP_CONCAT(IF(md.type = 'category', meta.meta_name, null)) AS category,
GROUP_CONCAT(IF(md.type = 'tag', meta.meta_name, null)) AS tag
FROM
( SELECT object_id
FROM meta_relation
GROUP BY object_id
HAVING count(*) = 1
) AS x
JOIN meta_relation AS mr ON mr.object_id = x.object_id
JOIN posts AS p ON p.post_id = mr.object_id
JOIN meta_data AS md ON mr.meta_data_id = md.meta_data_id
JOIN meta ON md.meta_id = meta.meta_id
WHERE meta.meta_name = ?
GROUP BY mr.object_id
Unfortunately I have no possibility to test performance,
But try my query using your real data:
http://sqlfiddle.com/#!9/81b29/13
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
INNER JOIN (
SELECT meta_relationships.object_id
FROM meta_relationships
GROUP BY meta_relationships.object_id
HAVING count(*) < 3
) mr ON mr.object_id = posts.post_id
LEFT JOIN meta_relationships ON mr.object_id = meta_relationships.object_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
INNER JOIN (
SELECT *
FROM meta
WHERE meta.meta_name = 'health'
) meta ON meta_data.meta_id = meta.meta_id
GROUP BY posts.post_id
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With