Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database schema confusing (Index and Constraints)

I have a little confusion about the designing of the schema, but before i start, let me show you the schema first,

CREATE TABLE Person
(
    PersonID INT NOT NULL PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    -- some columns here..
    CONSTRAINT tb_idF INDEX (FirstName),
    CONSTRAINT tb_idL INDEX (LastName)
    -- or 
    -- CONSTRAINT tb_idL INDEX (FirstName, LastName)
    -- other constraints ...
);

CREATE TABLE JobDescription
(
    JobDescriptionID INT NOT NULL PRIMARY KEY,
    JobDescriptionName VARCHAR(50) UNIQUE
    -- some columns here..
    -- constraints ...
);

and the confusion is here, the mapping table for tables: Person and JobDescription. Currently, I have this design,

CREATE TABLE Person_JobDescription
(
    RECID INT AUTO_INCREMENT PRIMARY KEY,   -- for some special reasons
                                            -- I need to map to other table
    PersonID INT,
    JobDescriptionID INT,
    StartYear INT,                          -- year JobDescription was Appointed
    EndYear INT,
    CONSTRAINT tb_fk1 FOREIGN KEY (PersonID) 
        REFERENCES Person(PersonID),
    CONSTRAINT tb_fk2 FOREIGN KEY (JobDescriptionID) 
        REFERENCES JobDescription(JobDescriptionID),
    CONSTRAINT tb_uq UNIQUE (PersonID, JobDescriptionID)
);

but i have this other idea which structure of the mapping table will be like this

CREATE TABLE Person_JobDescription
(
    PersonID INT,           -- map these two columns on the other table
    JobDescriptionID INT,   -- ^^
    StartYear INT,          -- year JobDescription was Appointed
    EndYear INT,
    CONSTRAINT tb_fk1 FOREIGN KEY (PersonID) 
        REFERENCES Person(PersonID),
    CONSTRAINT tb_fk2 FOREIGN KEY (JobDescriptionID) 
        REFERENCES JobDescription(JobDescriptionID),
    CONSTRAINT tb_pk PRIMARY KEY (PersonID, JobDescriptionID)
);

when I created and tested the query against the tables above, both of them returns the same results and the performance is also the same as I am testing on a small database (having 50k records). I wonder if how the two queries behave on large database.

QUESTIONS

  • Which of the two schemas of the mapping table (Person_JobDescription) will you prefer on large database?

As instructed, I am not allowed to create a UNIQUE constraint on FirstName and LastName. But i have supplied an index on the two columns.

  • What type of index will I use on table Person? An index for each column or a compound index for FirstName and LastName?
  • When will I use single index INDEX (Col1) and INDEX (Col2) over INDEX (Col1, Col2)?

Thank you for taking time to read on this question.

Best Regards,

Derek Floss

like image 975
SkyDrive Avatar asked Dec 15 '12 14:12

SkyDrive


1 Answers

I'd prefer the second approach. By using surrogate ID numbers when they're not logically necessary for identification, you introduce more mandatory joins. This requires you to "chase ID numbers all over the database", which is the SQL equivalent to "chasing pointers all over the database". Chasing pointers was characteristic of IMS, one of the database architectures the relational model intended to replace. (IMS uses a hierarchical architecture.) There's no point reinventing it today. (Although a lot of people do just that.)

If you have, for example, five levels of surrogate ID numbers, and you want a person's name, you have to do four joins to get it. Using the second approach, you just need one join. If you don't want to write multi-column joins, use CREATE VIEW and do it just once.

Performance is simple to test. Just generate a few million random-ish rows using your favorite scripting language, and load them into a test server. You'll not only find where your performance problems are hiding, you'll find all the errors in your CREATE TABLE code. (Your code won't work as-is.) Learn about EXPLAIN if you don't already know about it.

As for indexing, you can test that on the random-ish rows you generate and load. A multi-column index on (first_name, last_name) will work best if users always supply a first name. But a lot of users won't do that, preferring to search by last name instead. A multi-column index on (first_name, last_name) isn't effective for users who prefer to search by last name. You can test that.

For that reason alone, indexing of first names and last names is usually more effective if there are two separate indexes, one for the first name, and one for the last name.


What does chasing id numbers mean?

The unspoken design pattern underlying this question is "Every row must have an id number, and all foreign keys must reference the id number." In a SQL database, it's actually an anti-pattern. As a rule of thumb, any pattern that lets you design tables without thinking about keys should be presumed guilty until proven innocent--it should be presumed to be an anti-pattern until proven not to be.

create table A (
 a_id integer primary key,
 a_1 varchar(15) not null unique,
 a_2 varchar(15) not null
);

create table B (
  b_id integer primary key
  a_id integer not null references A (a_id),
  b_1  varchar(10) not null,
  unique (a_id, b_1),
);

create table C (
  c_id integer primary key,
  b_id integer not null references B (b_id),
  c_1 char(3) not null,
  c_2 varchar(20) not null,
  unique (b_id, c_1)
);

create table D (
  d_id integer primary key,
  c_id integer not null references C (c_id),
  d_1 integer not null,
  d_2 varchar(15),
  unique (c_id, d_1)
);

If you need a report on table "D", and the report needs

  • columns D.d_1 and D.d_2, and
  • columns A.a_1 and A.a_2,

you need 3 joins to get to it. (Try it.) You're chasing ID numbers. (Like chasing pointers in IMS.) The following structure is different.

create table A (
 a_1 varchar(15) primary key,
 a_2 varchar(15) not null
);

create table B (
  a_1 varchar(15) not null references A (a_1),
  b_1  varchar(10) not null,
  primary key (a_1, b_1),
);

create table C (
  a_1 varchar(15) not null,
  b_1 varchar(10) not null,
  c_1 char(3) not null,
  c_2 varchar(20) not null,
  primary key (a_1, b_1, c_1),
  foreign key (a_1, b_1) references B (a_1, b_1)
);

create table D (
  a_1 varchar(15) not null,
  b_1 varchar(10) not null,
  c_1 char(3) not null,
  d_1 integer not null,
  d_2 varchar(15),
  primary key (a_1, b_1, c_1, d_1),
  foreign key (a_1, b_1, c_1) references C (a_1, b_1, c_1)
);

With this structure, the same report needs a single join.

select D.d_1, D.d_2, A.a_1, A.a_2
from D
inner join A on D.a_1 = A.a_1;
like image 160
Mike Sherrill 'Cat Recall' Avatar answered Oct 20 '22 18:10

Mike Sherrill 'Cat Recall'