Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Diamond schema: how (de)normalized is that?

Let's suppose we have the following entities:

  • Production Studio
  • Journalist
  • Camera Operator
  • News Footage

In this simple world, production studio has many journalists and many camera operators. Each journalist belongs to exactly one studio. Same thing with operators. A news footage is produced by one journalist and one operator, where both come from the same studio.

Here's my naive approach to put this model into relational database:

CREATE TABLE production_studios(
  id                   SERIAL PRIMARY KEY,
  title                TEXT NOT NULL
);

CREATE TABLE journalists(
  id                   SERIAL PRIMARY KEY,
  name                 TEXT NOT NULL,
  prodution_studio_id  INTEGER NOT NULL REFERENCES production_studios
);

CREATE TABLE camera_operators(
  id                   SERIAL PRIMARY KEY,
  name                 TEXT NOT NULL,
  production_studio_id INTEGER NOT NULL REFERENCES production_studios
);

CREATE TABLE news_footages(
  id                   SERIAL PRIMARY KEY,
  description          TEXT NOT NULL,
  journalist_id        INTEGER NOT NULL REFERENCES journalists,
  camera_operator_id   INTEGER NOT NULL REFERENCES camera_operators
);

This schema forms nicely shaped diamond ERD and a few questions.

The problem is that news footage can link together a journalist with a camera operator which come from different production studios. I understand that this can be cured by writing corresponding constraints, but for the sake of experiment let's pretend that we're doing exercise in Normal Form database design.

  1. The first question is about terminology: is it correct to state that this schema is denormalized? If yes, which normal form does it break? Or is there any better name for this anomaly, like inter-record redundancy, multipath relationships, etc?

  2. How this schema can be changed to make described anomaly impossible?

And of course I'd very much appreciate references to papers addressing this specific issue.

like image 867
Serge Balyuk Avatar asked Feb 23 '12 18:02

Serge Balyuk


People also ask

How do you define a schema?

In computer programming, a schema (pronounced SKEE-mah) is the organization or structure for a database, while in artificial intelligence (AI) a schema is a formal expression of an inference rule. For the former, the activity of data modeling leads to a schema.

What is database schema with example?

A database schema is considered the “blueprint” of a database which describes how the data may relate to other tables or other data models. However, the schema does not actually contain data. A sample of data from a database at a single moment in time is known as a database instance.

What is database schema design?

Database schema design organizes the data into separate entities, determines how to create relationships between organized entities, and how to apply the constraints on the data. Designers create database schemas to give other database users, such as programmers and analysts, a logical understanding of the data.


1 Answers

The naive way would be to make your journalists and camera_operators dependent entities, dependent upon the studio for which they work. That means the production studio foreign key becomes part of their primary key. Your news_footage table then has a primary key consisting of 4 components:

  • production_studio_id
  • journalist_id
  • camera_operator_id
  • footage_id

and two foreign keys:

  • journalist_id,production_studio_id, pointing to the journalist table, and
  • camera_operator,production_studio_id, pointing to the camera operator table

Easy.

Or Not. Now you have defined in your E-R model the notion that the very existence of a camera operator or a journalist is dependent upon the studio for which they work. This does not reflect the real work very well: in this model, people can't change their employer.

Let's not do that.

In your original model, you confusing a person with a _role they play (journalist or camera operator), and you're missing a somewhat transient entity that is actually responsible for the production of your news footage: the [studio-specific] production team.

My E-R model would look something like this:

create table studio
(
  id int not null primary key ,
  title varchar(200) not null ,
)

create table person
(
  id int not null primary key ,
  title varchar(200) not null ,
)

create table team
(
  studio_id          int not null ,
  journalist_id      int not null ,
  camera_operator_id int not null ,

  primary key ( studio_id , journalist_id , camera_operator ) ,

  foreign key ( studio_id          ) references studio ( id ) ,
  foreign key ( journalist_id      ) references person ( id ) ,
  foreign key ( camera_operator_id ) references person ( id ) ,

)

create table footage
(
  studio_id          int not null ,
  journalist_id      int not null ,
  camera_operator_id int not null ,
  id                 int not null ,
  description        varchar(200) not null ,

  primary key ( studio_id , journalist_id , camera_operator_id , id ) ,

  foreign key     ( studio_id , journalist_id , camera_operator_id )
  references team ( studio_id , journalist_id , camera_operator_id ) ,

)

Now you have a world in which people can work in different roles: the same person might be a camera operator in some contexts and a journalist in others. People can change employers. Studio-specific teams are composed, consisting of a journalist and a camera operator. In some contexts, the same person might play both roles on a team. And, finally, a piece of news footage is produced by one and only one studio-specific team.

This reflects the real world much better, and it is much more flexible.

Edited to add sample query:

To find the journalists working for a particular studio:

select p.*
from studio s
join team   t on t.studio_id = s.id
join person p on p.id        = t.journalist_id
where s.title = 'my desired studio name'

This would give you the set of people who are (or have) been associated with a studio in the role of journalist. One should note though, that in the real world, people work for employers for a period of time: to model it properly you need a start/end date and you need to qualify the query with a relative notion of now.

like image 98
Nicholas Carey Avatar answered Oct 13 '22 20:10

Nicholas Carey