Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database design for getting unread article in mysql (over 100M row table)

This information is very condensed.

There are 2 tables.

article

-----------------------------------
|id | weight | text               |
-----------------------------------
|1  | 10     | blah               |
|2  | 100    | blah               |
|3  | 50     | blah               |
|4  | 1000   | blah               |
-----------------------------------

read

-----------------------------------
| user_id | article_id            |
-----------------------------------
| 1       | 4                     |
| 1       | 2                     |
| 1       | 3                     |
| 2       | 3                     |
| 2       | 4                     |
-----------------------------------

I want to get unread articles using below query (very condensed)

SELECT 
    a.* 
FROM 
    article a LEFT OUTER JOIN read r ON r.article_id = a.id and r.user_id = 1
WHERE 
    r.id IS NULL
ORDER BY
    a.weight DESC
LIMIT 10

important information

  1. the number of read table rows keeps under 1000 per user. (remove old data)
  2. weight column in article table is changed frequently. (It means order not fixed)

problem is .. (when number of users : over 1M)

  1. the way to get unread articles using read table (not in, outer join is not important)
    • number of read table rows will be over 1G

It works well so far (current # of read table rows : 100M). but I have to prepare next step because number of users is increasing rapidly.

What is the best way for large service in this case?

(sharding? partitioning table? or redesign architecture?)

Thanks in advance

like image 941
kimwz.kr Avatar asked Jul 22 '15 03:07

kimwz.kr


1 Answers

  • Add a column to article. It will be a flag saying whether the article is read/unread. (Do not make it a user count or a timestamp; that will slowdown the subsequent steps.)
  • Whenever a user reads an article, check the flag and change it if needed.
  • Have `INDEX(flag, weight, id) -- this will let your query run almost instantly. This should be OK on that million-row table.

A problem: Since you are purging (after 1000), some "read" articles can become "unread". To deal with this, batch the purging, and gather the distinct list of articles that got purged. Then do the tedious task of re-computing the flag, but just for those articles. INDEX(article_id) will help; use EXISTS ( SELECT * FROM read WHERE article_id = $aid ). (This can probably be turned into a batch operation rather than one aid at a time.)

Another problem: secondary keys on billion-row tables are costly -- they may lead to a lot of I/O. Before attempting to address this problem, please provide SHOW CREATE TABLE for both tables, plus any other common SELECTs. Picking the right index(es) and datatypes is very important to performance in billion-row tables..

like image 180
Rick James Avatar answered Oct 13 '22 12:10

Rick James