Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inner Join vs Natural Join vs USING clause: are there any advantages?

Tags:

sql

join

Imagine I have two simple tables, such as:

CREATE TABLE departments(dept INT PRIMARY KEY, name);
CREATE TABLE employees(id PRIMARY KEY, fname, gname,
    dept INT REFERENCES departments(dept));

(simplified, of course).

I could have any of the following statements:

SELECT * FROM employees e INNER JOIN departments d ON e.dept=d.dept;
SELECT * FROM employees e NATURAL JOIN departments d;
SELECT * FROM employees e JOIN departments d USING(dept);

A working example can be found here: SQL Fiddle: http://sqlfiddle.com/#!15/864a5/13/10

They all give nearly the same results — certainly the same rows.

I have always preferred the first form because of its flexibility, readability and predictability — you clearly define what is connected to what.

Now, apart from the fact that the first form has a duplicated column, is there a real advantage to the other two forms? Or are they just syntactic sugar?

I can see the disadvantage in the latter forms is that you are expected to have named your primary and foreign keys the same, which is not always practical.

like image 340
Manngo Avatar asked Jan 27 '16 08:01

Manngo


1 Answers

Now, apart from the fact that the first form has a duplicated column, is there a real advantage to the other two forms? Or are they just syntactic sugar?

TL;DR NATURAL JOIN is used in a certain style of relational programming that is simpler than the usual SQL style. (Although when embedded in SQL it is burdened with the rest of SQL query syntax.) That's because 1. it directly uses the simple operators of predicate logic, the language of precision in engineering (including software engineering), science (including computer science) and mathematics, and moreover 2. simultaneously and alternatively it directly uses the simple operators of relational algebra.

The common complaint about NATURAL JOIN is that since shared columns aren't explicit, after a schema change inappropriate column pairing may occur. And that may be the case in a particular development environment. But in that case there was a requirement that only certain columns be joined and NATURAL JOIN without PROJECT was not appropriate. So these arguments assume that NATURAL JOIN is being used inappropriately. Moreover the arguers aren't even aware that they are ignoring requirements. Such complaints are specious. (Moreover, sound software engineering design principles lead to not having interfaces with such specificiatons.)

Another related misconceived specious complaint from the same camp is that "NATURAL JOIN does not even take foreign key relationships into account". But any join is there because of the table meanings, not the constraints. Constraints are not needed to query. If a constraint is added then a query remains correct. If a constraint is dropped then a query relying on it becomes wrong and must be changed to a phrasing that doesn't rely on it that wouldn't have had to change. This has nothing to do with NATURAL JOIN.


You have described the difference in effect: just one copy of each common column is returned.

From Is there any rule of thumb to construct SQL query from a human-readable description?:

It turns out that natural language expressions and logical expressions and relational algebra expressions and SQL expressions (a hybrid of the last two) correspond in a rather direct way.

Eg from Codd 1970:

The relation depicted is called component. [...] The meaning of component(x, y,z) is that part x is an immediate component (or subassembly) of part y, and z units of part x are needed to assemble one unit of part y.

From this answer:

Every base table has a statement template, aka predicate, parameterized by column names, by which we put a row in or leave it out.

Plugging a row into a predicate gives a statement aka proposition. The rows that make a true proposition go in a table and the rows that make a false proposition stay out. (So a table states the proposition of each present row and states NOT the proposition of each absent row.)

But every table expression value has a predicate per its expression. The relational model is designed so that if tables T and U hold rows where T(...) and U(...) (respectively) then:

  • T NATURAL JOIN U holds rows where T(...) AND U(...)
  • T WHEREcondition holds rows where T(...) AND condition
  • T UNION CORRESPONDING U holds rows where T(...) OR U(...)
  • T EXCEPT CORRESPONDING U holds rows where T(...) AND NOT U(...)
  • SELECT DISTINCTcolumns to keepFROM T holds rows where
    THERE EXISTS columns to drop SUCH THAT T(...)
  • etc

Whereas reasoning about SQL otherwise is... not "natural":

An SQL SELECT statement can be thought of algebraically as 1. implicitly RENAMEing each column C of a table with (possibly implicit) correlation name T to T.C, then 2. CROSS JOINing, then 3. RESTRICTing per INNER ON, then 4. RESTRICTing per WHERE, then 5. PROJECTing per SELECT, then 6. RENAMEing per SELECT, dropping T.s, then 7. implicitly RENAMEing to drop remaining T.s Between the T.-RENAMEings algebra operators can also be thought of as logic operators and table names as their predicates: T JOIN ... vs Employee T.EMPLOYEE has name T.NAME ... AND .... But conceptually inside a SELECT statement is a double-RENAME-inducing CROSS JOIN table with T.Cs for column names while outside tables have Cs for column names.

Alternatively an SQL SELECT statement can be thought of logically as 1. introducing FORSOME T IN E around the entire statement per correlation name T and base name or subquery E, then 2. referring to the value of quantified T by using T.C to refer to its C part, then 3. building result rows from T.Cs per FROM etc, then 4. naming the result row columns per the SELECT clause, then 4. leaving the scope of the FORSOMEs. Again the algebra operators are being thought of as logic operators and table names as their predicates. Again though, this conceptually has T.C inside SELECTs but C outside with correlation names coming and going.

These two SQL interpretations are nowhere near as straightforward as just using JOIN or AND, etc, interchangeably. (You don't have to agree that it's simpler, but that perception is why NATURAL JOIN and UNION/EXCEPT CORRESPONDING are there.) (Arguments criticizing this style outside the context of its intended use are specious.)

USING is a kind of middle ground orphan with one foot in the NATURAL JOIN camp and one in the CROSS JOIN. It has no real role in the former because there are no duplicate column names there. In the latter it is more or less just abbreviating JOIN conditions and SELECT clauses.

I can see the disadvantage in the latter forms is that you are expected to have named your primary and foreign keys the same, which is not always practical.

PKs (primary keys), FKs (foreign keys) & other constraints are not needed for querying. (Knowing a column is a function of others allows scalar subqueries, but you can always phrase without.) Moreover any two tables can be meaningfully joined. If you need two columns to have the same name with NATURAL JOIN you rename via SELECT AS.

like image 186
philipxy Avatar answered Sep 19 '22 00:09

philipxy