i have this table (500,000 row)
CREATE TABLE IF NOT EXISTS `listings` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`type` tinyint(1) NOT NULL DEFAULT '1',
`hash` char(32) NOT NULL,
`source_id` int(10) unsigned NOT NULL,
`link` varchar(255) NOT NULL,
`short_link` varchar(255) NOT NULL,
`cat_id` mediumint(5) NOT NULL,
`title` mediumtext NOT NULL,
`description` mediumtext,
`content` mediumtext,
`images` mediumtext,
`videos` mediumtext,
`views` int(10) unsigned NOT NULL,
`comments` int(11) DEFAULT '0',
`comments_update` int(11) NOT NULL DEFAULT '0',
`editor_id` int(11) NOT NULL DEFAULT '0',
`auther_name` varchar(255) DEFAULT NULL,
`createdby_id` int(10) NOT NULL,
`createdon` int(20) NOT NULL,
`editedby_id` int(10) NOT NULL,
`editedon` int(20) NOT NULL,
`deleted` tinyint(1) NOT NULL,
`deletedon` int(20) NOT NULL,
`deletedby_id` int(10) NOT NULL,
`deletedfor` varchar(255) NOT NULL,
`published` tinyint(1) NOT NULL DEFAULT '1',
`publishedon` int(11) unsigned NOT NULL,
`publishedby_id` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `hash` (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
i'm thinking to make each query by the publishedon between x and y
(show in all the site just records of 1 month)
in the same time, i want to add with the publishedon
in the where clause published, cat_id , source_id
some thing like this:
SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
that query is ok and fast until now without indexing, but when trying to use order by publishedon
its became too slow, so i used this index
CREATE INDEX `listings_pcs` ON listings(
`publishedon` DESC,
`published` ,
`cat_id` ,
`source_id`
)
it worked and the order by publishedon
became fast, now i want to order by views
like this
SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
this is the explanation
this query is too slow because of ORDER BY views DESC
then i'm tried to drop the old index and add this
CREATE INDEX `listings_pcs` ON listings(
`publishedon` DESC,
`published` ,
`cat_id` ,
`source_id`,
`views` DESC
)
its too slow also
what about if i use just single index on publishedon
?
what about using single index on cat_id,source_id,views,publishedon?
i can change the query dependencies like publishedon in one month if i found other indexing method depend on any other columns
what about making index in (cat_id
, source_id
, publishedon
, published
) ? but in some cases i will use source_id only?
what is the best indexing schema for that table
Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs.
MySQL has three types of indexes: INDEX, UNIQUE (which requires each row to have a unique value), and PRIMARY KEY (which is just a particular UNIQUE index).
Indexing makes columns faster to query by creating pointers to where data is stored within a database. Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it.
One important general note as to why your query isn't getting any faster despite your attempts is that DESC
on indexes is not currently supported on MySQL. See this SO thread, and the source from which it comes.
In this case, your largest problem is in the sheer size of your record. If the engine decides it wouldn't really be faster to use an index, then it won't.
You have a few options, and all are actually pretty decent and can probably help you see significant improvement.
First, I want to make a quick note about indexing in SQL. While I don't think it's the solution for your woes, it was your main question, and can help.
It usually helps me to think about indexing in three different buckets. The absolutely, the maybe, and the never. You certainly don't have anything in your indexing that's in the never column, but there are some I would consider "maybe" indexes.
absolutely: This is your primary key and any foreign keys. It is also any key you will reference on a very regular basis to pull a small set of data from the massive data you have.
maybe: These are columns which, while you may reference them regularly, are not really referenced by themselves. In fact, through analysis and using EXPLAIN
as @Machavity recommends in his answer, you may find that by the time these columns are used to strip out fields, there aren't that many fields anyway. An example of a column that would solidly be in this pile for me would be the published
column. Keep in mind that every INDEX
adds to the work your queries need to do.
Also: Composite keys are a good choice when you're regularly searching for data based on two different columns. More on that later.
There are a number of options to consider, and each one has some drawbacks. Ultimately I would consider each of these on a case-by-case basis as I don't see any of these to be a silver bullet. Ideally, you'd test a few different solutions against your current setting and see which one runs the fastest using a nice scientific test.
This is one of the few times where, despite the number of columns in your table, I wouldn't rush to try to split your table into smaller chunks. If you decided to split it into smaller chunks, however, I'd argue that your [action]edon
, [action]edby_id
, and [action]ed
could easily be put into another table, actions
:
+-----------+-------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| action_id | int(11) | NO | | NULL | |
| action | varchar(45) | NO | | NULL | |
| date | datetime | NO | | CURRENT_TIMESTAMP | |
| user_id | int(11) | NO | | NULL | |
+-----------+-------------+------+-----+-------------------+----------------+
The downside to this is that it does not allow you to ensure there is only one creation date without a TRIGGER
. The upside is that when you don't have to sort as many columns with as many indexes when you're sorting by date. Also, it allows you to sort not only be created
, but also by all of your other actions.
Edit: As requested, here is a sample sorting query
SELECT * FROM listings
INNER JOIN actions ON actions.listing_id = listings.id
WHERE (actions.action = 'published')
AND (listings.published = 1)
AND (listings.cat_id in(1,2,3,4,5))
AND (listings.source_id in(1,2,3,4,5))
AND (actions.actiondate between 1441105258 AND 1443614458)
ORDER BY listings.views DESC
Theoretically, it should cut down on the number of rows you're sorting against because it's only pulling relevant data. I don't have a dataset like yours so I can't test it right now!
If you put a composite key on actiondate
and listings.id
, this should help to increase speed.
As I said, I don't think this is the best solution for you right now because I'm not convinced it's going to give you the maximum optimization. This leads me to my next suggestion:
I used this nifty tool to confirm what I thought I understood of your question: You are sorting by month here. Your example is specifically looking between September 1st and September 30th, inclusive.
So another option is for you to split your integer function into a month
, day
, and year
field. You can still have your timestamp, but timestamps aren't all that great for searching. Run an EXPLAIN
on even a simple query and you'll see for yourself.
That way, you can just index the month and year fields and do a query like this:
SELECT * FROM listings
WHERE (publishedmonth = 9)
AND (publishedyear = 2015)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
Slap an EXPLAIN
in front and you should see massive improvements.
Because you're planning on referring to a month and a day, you may want to add a composite key against month and year, rather than a key on both separately, for added gains.
Note: I want to be clear, this is not the "correct" way to do things. It is convenient, but denormalized. If you want the correct way to do things, you'd adapt something like this link but I think that would require you to seriously reconsider your table, and I haven't tried anything like this, having lacked the need, and, frankly, will, to brush up on my geometry. I think it's a little overkill for what you're trying to do.
This was hard for me to come to terms with because I like to do things the "SQL" way wherever possible, but that is not always the best solution. Heavy computing, for example, is best done using your programming language, leaving SQL to handle relationships.
The former CTO of Digg sorted using PHP instead of MySQL and received a 4,000% performance increase. You're probably not scaling out to this level, of course, so the performance trade-offs won't be clearcut unless you test it out yourself. Still, the concept is sound: the database is the bottleneck, and computer memory is dirt cheap by comparison.
There are doubtless a lot more tweaks that can be done. Each of these has a drawback and requires some investment. The best answer is to test two or more of these and see which one helps you get the most improvement.
This query:
SELECT *
FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) AND
(published = 1) AND
(cat_id in (1,2,3,4,5)) AND
(source_id in (1,2,3,4,5));
Is hard to optimize with only indexes. The best index is one that starts with published
and then has the other columns -- it is not clear what their order should be. The reason is because all but published
are not using =
.
Because your performance problem is on a sort, that suggests that lots of rows are being returned. Typically, an index is used to satisfy the WHERE
clause before the index can be used for the ORDER BY
. That makes this hard to optimize.
Suggestions . . . None are that great:
ORDER BY
faster, but won't help the ORDER BY
.published
in the index. You might find the most selective column(s). But, once again, this speeds the query before the sorting.WHERE
clause or to return a smaller set of data.published
and the ordering column. Then use a subquery to fetch the data. Put the inequality conditions (IN
and so on) in the outer query. The subquery will use the index for sorting and then filter the results.The reason the last is not recommended is because SQL (and MySQL) do not guarantee the ordering of results from a subquery. However, because MySQL materializes subqueries, the results really are in order. I don't like using undocumented side effects, which can change from version to version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With