For example, we have table A, and table B which have a many-to-many relationship. An intersection table, Table C stores A.id and B.id along with a value that represents a relationship between the two. Or as a concrete example, imagine stackexchange which has a user account, a forum, and a karma score. Or, a student, a course, and a grade. If table A and B are very large, table C can and probably will grow monstrously large very quickly(in fact lets just assume it does). How do we go about dealing with such an issue? Is there a better way to design the tables to avoid this?
There is no magic. If some rows are connected and some aren't, this information has to be represented somehow, and the "relational" way of doing it is a "junction" (aka "link") table. Yes, a junction table can grow large, but fortunately databases are very capable of handling huge amounts of data.
There are good reasons for using junction table versus comma-separated list (or similar), including:
When designing a junction table, ask the following questions:
In many cases, answers to these questions will be: both, yes and no, in which case your table will look similar to this (Oracle syntax below):
CREATE TABLE JUNCTION_TABLE (
PARENT_ID INT,
CHILD_ID INT,
EXTRA_DATA VARCHAR2(50),
PRIMARY KEY (PARENT_ID, CHILD_ID),
FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
) ORGANIZATION INDEX COMPRESS;
CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
JUNCTION_TABLE (CHILD_ID, PARENT_ID, EXTRA_DATA) COMPRESS;
Considerations:
ORGANIZATION INDEX
: Oracle-specific syntax for what most DBMSes call clustering. Other DBMSes have their own syntax and some (MySQL/InnoDB) imply clustering and user cannot turn it off.COMPRESS
: Some DBMSes support leading-edge index compression. Since clustered table is essentially an index, compression can be applied to it as well.JUNCTION_TABLE_IE1
, EXTRA_DATA
: Since extra data is covered by the secondary index, DBMS can get it without touching the table when querying in the direction from child to parents. Primary key acts as a clustering key so the extra data is naturally covered when querying from a parent to the children.Physically, you have just two B-Trees (one is the clustered table and the other is the secondary index) and no table heap at all. This translates to good querying performance (both parent-to-child and child-to-parent directions can be satisfied by a simple index range scan) and fairly small overhead when inserting/deleting rows.
Here is the equivalent MS SQL Server syntax (sans index compression):
CREATE TABLE JUNCTION_TABLE (
PARENT_ID INT,
CHILD_ID INT,
EXTRA_DATA VARCHAR(50),
PRIMARY KEY (PARENT_ID, CHILD_ID),
FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
);
CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
JUNCTION_TABLE (CHILD_ID, PARENT_ID) INCLUDE (EXTRA_DATA);
Note that MS SQL Server automatically clusters tables, unless PRIMARY KEY NONCLUSTERED is specified.
1 In other words, do you only need to get "children" of given "parent", or you might also need to get parents of given child.
2 Covering allows the query to be satisfied from the index alone, and avoids expensive double-lookup that would otherwise be necessary when accessing data through a secondary index in the clustered table.
3 This way, the extra data is not repeated (which would be expensive, since it's big), yet you avoid the double-lookup and replace it with (cheaper) table heap access. But, beware of clustering factor that can destroy the performance of range scans in heap-based tables!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With