Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In MySQL, how to JOIN two very large tables which both have columns in the WHERE condition?

I'm trying to determine the best general approach for querying against joined two tables that have a lot of data, where each table has a column in the where clause. Imagine a simple schema w/ two tables:

posts
 id (int)
 blog_id (int)
 published_date (datetime)
 title (varchar)
 body (text)

posts_tags 
 post_id (int)
 tag_id (int)

With the following indexes:

posts: [blog_id, published_date]
tags: [tag_id, post_id]

We want to SELECT the 10 most recent posts on a given blog that were tagged with "foo". For the sake of this discussion, assume the blog has 10 million posts, and 1 million of those have been tagged with "foo". What is the most efficient way to query for this data?

The naive approach would be to do this:

 SELECT 
  id, blog_id, published_date, title, body
 FROM 
  posts p
 INNER JOIN
  posts_tags pt 
  ON pt.post_id = p.id
 WHERE
  p.blog_id = 1
  AND pt.tag_id = 1
 ORDER BY
  p.published_date DESC
 LIMIT 10

MySQL will use our indexes, but will still end up scanning millions of records. Is there a more efficient way to retrieve this data w/o denormalizing the schema?

like image 440
Newt Avatar asked Sep 07 '10 21:09

Newt


People also ask

Can you join tables in the where clause?

You join two tables by creating a relationship in the WHERE clause between at least one column from one table and at least one column from another. The join creates a temporary composite table where each pair of rows (one from each table) that satisfies the join condition is linked to form a single row.

Which type of join should be used when both the tables are larger in size in Hadoop?

Reduce side joins They are the most widely used joins. Reduce side joins happen when both the tables are so big that they cannot fit into the memory. The process flow of reduce side joins is as follows: The input data is read by the mapper, which needs to be combined on the basis of the join key or common column.

Which join is based on all columns in the two tables that have the same column name?

A NATURAL JOIN is a JOIN operation that creates an implicit join clause for you based on the common columns in the two tables being joined. Common columns are columns that have the same name in both tables. A NATURAL JOIN can be an INNER join, a LEFT OUTER join, or a RIGHT OUTER join.

What is the most efficient way of joining 2 table in same database?

Method 1: Relational Algebra Relational algebra is the most common way of writing a query and also the most natural way to do so. The code is clean, easy to troubleshoot, and unsurprisingly, it is also the most efficient way to join two tables.


1 Answers

Any filters you want to do on a joined table should go in the join. Technically, the WHERE clause should contain only conditions that require more than 1 table or the primary table. While it may not speed up all queries, it assures MySQL optimizes the query properly.

FROM 
posts p
INNER JOIN
posts_tags pt 
ON pt.post_id = p.id
    AND pt.tag_id = 1
WHERE
p.blog_id = 1
like image 116
Brent Baisley Avatar answered Sep 17 '22 16:09

Brent Baisley