Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way of retrieving aggregate data from large table?

Tags:

mysql

I have a table that grows by tens of millions of rows each day. The rows in the table contain hourly information about page view traffic.

The indices on the table are on url and datetime.

I want to aggregate the information by day, rather than hourly. How should I do this? This is a query that exemplifies what I am trying to do:

SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= "2012-08-29 00:00:00" AND datetime <= "2012-08-29 23:00:00"
GROUP BY url
ORDER BY pageviews DESC
LIMIT 10;

The above query never finishes, though. There are millions of rows in the table. Is there a more efficient way that I can get this aggregate data?

like image 479
egidra Avatar asked Dec 20 '22 16:12

egidra


2 Answers

Tens of millions of rows per day is quite a lot.

Assuming:

  • only 10 million new records per day;
  • your table contains only the columns that you mention in your question;
  • url is of type TEXT with a mean (Punycode) length of ~77 characters;
  • pageviews is of type INT;
  • int_views is of type INT;
  • ext_views is of type INT; and
  • datetime is of type DATETIME

then each day's data will occupy around 9.9 × 108 bytes, which is almost 1GiB/day. In reality it may be considerably more, because the above assumptions were quite conservative.

MySQL's maximum table size is determined, amongst other things, by the underlying filesystem on which its data files reside. If you're using the MyISAM engine (as suggested by your comment beneath) without partitioning on Windows or Linux, then a limit of a few GiB is not uncommon; which implies the table will reach its capacity well within a working week!

As @Gordon Linoff mentioned, you should partition your table; However, each table has a limit of 1024 partitions. With 1 partition/day (which would be imminently sensible in your case), you will be limited to storing under 3 years of data in a single table before the partitions start getting reused.

I would therefore advise that you keep each year's data in its own table, each partitioned by day. Furthermore, as @Ben explained, a composite index on (datetime, url) would help (I actually propose creating a date column from DATE(datetime) and indexing that, because it will enable MySQL to prune the partitions when performing your query); and, if row-level locking and transactional integrity are not important to you (for a table of this sort, they may not be), using MyISAM may not be daft:

CREATE TABLE news_2012 (
  INDEX (date, url(100))
)
Engine = MyISAM
PARTITION BY HASH(TO_DAYS(date)) PARTITIONS 366
SELECT *, DATE(datetime) AS date FROM news WHERE YEAR(datetime) = 2012;

CREATE TRIGGER news_2012_insert BEFORE INSERT ON news_2012 FOR EACH ROW
  SET NEW.date = DATE(NEW.datetime);

CREATE TRIGGER news_2012_update BEFORE UPDATE ON news_2012 FOR EACH ROW
  SET NEW.date = DATE(NEW.datetime);

If you choose to use MyISAM, you can not only archive completed years (using myisampack) but can also replace your original table with a MERGE one comprising the UNION of all of your underlying year tables (an alternative that would also work in InnoDB would be to create a VIEW, but it would only be useful for SELECT statements as UNION views are neither updatable nor insertable):

DROP TABLE news;
CREATE TABLE news (
  date DATE,
  INDEX (date, url(100))
)
Engine = MERGE
INSERT_METHOD = FIRST
UNION = (news_2012, news_2011, ...)
SELECT * FROM news_2012 WHERE FALSE;

You can then run your above query (along with any other) on this merge table:

SELECT   url, SUM(pageviews), SUM(int_views), SUM(ext_views)
FROM     news
WHERE    date = '2012-08-29'
GROUP BY url
ORDER BY SUM(pageviews) DESC
LIMIT    10;
like image 76
eggyal Avatar answered Jan 19 '23 14:01

eggyal


A few points:

  1. Also, as the only predicate that you're filtering on you should probably have an index with datetime as the first column.
  2. You're ordering by pageviews. I would have assumed that you want to order by sum(pageviews).
  3. You're querying 23 hours of data not 24. You probably want to use an explicit less than, <, from midnight the next day to avoid missing anything.
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
  FROM news
 WHERE datetime >= '2012-08-29 00:00:00'
   AND datetime < '2012-08-30 00:00:00'
 GROUP BY url
 ORDER BY sum(pageviews) DESC
 LIMIT 10;

You could index this on datetime, url, pageviews, int_views, ext_views but I think that would be overkill; so, if the index isn't too big datetime, url seems like a good way to go. The only way to be certain is to test it and decide if any performance improvements in querying are worth the extra time taken in index maintenance.

As Gordon's just mentioned in the comments you may need to look into partitioning. This enables you to query a smaller "table" that is part of the larger one. If all your queries are based at the day level it sounds like you might need to create a new one each day.

like image 38
Ben Avatar answered Jan 19 '23 14:01

Ben