Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When should I use a composite primary key?

ETA: My question is based upon keeping an optimal database. What is the difference in database performance/size between having an all composite primary key for ProjectUserBooleanAttribute, which would presumably have indexes for PUAT_Id and UserID and a non-composite table using an auto-increment PK but then having indexes for both PUAT_Id and UserID? From further reading it seems if I went with the non-composite approach I would have to create a unique index on the those two columns. Would I still need to create indexes on those two columns? If so, doesn't this essentially means every column in that table would have it's own index?

Is this the quintessential dilemma of database size (indexes) vs performance?


So I have the following entities I wish to create

  • Projects
  • Users
  • ProjectUserAttributeTypes

In this simplified example, all of my ProjectUserAttributeTypes are going to be boolean, so I'm only showing the ProjectUserBooleanAttribute table.

Let's say I want to create two boolean ProjectUserAttributeType called Silver and Gold. I would simply create two rows in ProjectUserAttributeTypes. Now if I want to assign a user as having that attribute, I would add a row to the ProjectUserBooleanAttribute

A DBA has warned me off from using composite primary keys in general for performance reasons. However in this instance I fail to see what I gain by NOT using composites. In both cases I will need to make sure that ProjectUserBooleanAttribute has non null and unique values for all columns. I will surely need indexes as well.

NOTE: My ultimate objective is to be able to query my database and find all users that have certain attributes combinations. I would join the tables to filter by project, then use where clauses to filter even further. A few examples:

  • (GOLD OR SILVER)
  • (GOLD XOR SILVER)
  • ((GOLD OR SILVER) AND NOT (BRONZE))

COMPOSITE PK

Compsite ERD

VERSUS NON-COMPOSITE PK

No Composite ERD

like image 868
WhiskerBiscuit Avatar asked Jul 11 '15 19:07

WhiskerBiscuit


2 Answers

There are two main designs for relational databases:

  • natural keys
  • IDs

With natural keys you use the keys given: A project is identified by its project number, a user by their login name or number, etc. This often leads to compound (or composite) keys:

  • project (project_no, name, ...)
  • user (user_name, first_name, last_mname, ...)
  • project_user (project_no, user_name, role, ...)

The table project_user has a compound key: the project number plus the user name uniquely identifying a record, telling us who is working on which project.

With IDs, you usually add a technical ID that is only used to link records and has no meaning to a user:

  • project (project_id, project_no, name, ...)
  • user (user_id, user_name, first_name, last_mname, ...)
  • project_user (project_user_id, project_id, user_id, role, ...)

The tables contain the same fields plus the IDs and you need the same unique and not null constraints as with natural keys plus the constraints on the IDs.

The project_user_id in project_user is only necessary of course when there is any table needing that reference. But often every table gets an ID whether needed or not, just to make them all look alike (and so the IDs are already there in case they are needed later).

At first glance it seems that a database based on IDs is only more work, more indexes and nothing gained, but this is not the case. An ID concept is often chosen, because it gives more freedom. An example: What would happen, if a project number could change? With the natural keys, the project number is in many tables and would have to get updated cascadingly somehow, which can become quite a task. In an ID database, you would just change the project number in its one place.

And what would happen, if suddenly project numbers were only unique within a company? In an ID based database, you would add a company_id to the projects table, add a unique index on company_id and project_no and be done with it. With natural keys a company number (an ILN? an artificial number?) would have to be added to the primary key and would have to be introduced in all child tables. So: When you design a database with natural keys, you must think it all through to get to stable natural keys - and sometimes there are none, and then you have to invent some. With IDs you don't care that much whether fields can change or not. An ID based database is thus easier to implement.

Then there is hierarchy. Lets say there are several companies in your database, each with its own items, own warehouses.

Natural keys:

  • company (company_code, name, ...)
  • item (company_code, item_no, name, ...)
  • warehouse (company_code, warehouse_no, address, ...)
  • stock (company_code, warehouse_no, item_no, amount, ...)

IDs:

  • company (company_id, name, ...)
  • item (item_id, item_no, name, company_id, ...)
  • warehouse (warehouse_id, address, company_id, ...)
  • stock (stock_id, warehouse_id, item_id, amount, ...)

With the ID concept, you don't have to name the company_id again in the stock table, because its known from a parent table. To store it there would even be redundant, whereas in a natural key concept it is needed, because it's part of the compound key and without it we would lose the link to its parent tables. This purity is seen by some people as a great advantage of the ID concept over natural keys. However, there comes a disadvantage with it. In the natural key database it is guaranteed that a company's items are in the company's warehouses, as the company is part of the key of the stock table. With an ID concept the linked warehouse record could belong to company 1 and the linked item record to company 2. Inconsistent data caused by some wrong insert statement which the DBMS could not prevent us from. With the natural keys such an error can not occur.

And if I want to know how many stock a company has, I simply select from stock with natural keys. But I would have to select from stock plus another table to get the company in an ID database.

With much hierarchy you may get queries with many, many more tables involved when the database is ID based. So far I haven't seen an ID based database outperform one with natural keys. But I have seen natural key based databases outperform ID based ones by far. This may be because I have mainly seen big databases with much hierarchy.

As to your database: It seems to be ID based, provided project ID and user ID are only technical internal numbers - otherwise your database would be a mixed concept (natural project number, natural user ID, technical ID for ProjectUserBooleanAttribute). So your question has not really to do with compound keys or not.

PUAT_ID and UserID will both have to be in ProjectUserBooleanAttribute, they would be not null, and you should have a unique constraint (a unique index) on them. So they have all qualities a primary key needs, no matter whether you call this "primary key" or not. Whether you add a technical ID is just for the looks of it. It doesn't really change anything. The concept stays the same.

In a natural key concept you would make the fields the primary key. But then, you would not have a PUAT_Id, but some composite key here (ProjectId plus AttributeType?).

In a technical ID concept you would not make this the primary key, but make the fields not null and add a unique constraint (which makes it the key, only it is not called "primary"). Then either add a technical ID as the primary key or have the table without ID, and thus without a primary key. It doesn't matter. If someone asks for a key, give them the ID, if not, you can go without it. It is superfluous, as long as not needed by any other table.

like image 91
Thorsten Kettner Avatar answered Oct 11 '22 17:10

Thorsten Kettner


When you add an id column to a table it unavoidably adds overhead to managing that table. But the benefit is that other tables can now reference rows of that table by a single id column instead of the old composite key. This can make storage and indexes smaller and access faster for those tables. Also it can make references to the corresponding entities briefer (only one column) and more apparent (by the kind of id).

Note that in tables that now use that added id as FK, if you keep any of the old composite FK columns along with the id then you should have a constraint that the values for those columns are the same as the values for those columns in the row that the id references.

like image 20
philipxy Avatar answered Oct 11 '22 18:10

philipxy