We have two tables: OriginalDocument and ProcessedDocument. In the first one we put an original, not processed document. After it's validated and processed (converted to our XML format and parsed), it's put into ProcessedDocument table. Processed document can be valid or invalid. Which makes more sense: have two different tables for valid and invalid documents or just have one with 'Valid' column? Some of the columns (~5-7) are irrelevant for invalid document. Storing both invalid and valid documents would also make Document table filled with 'NULL' columns (if document is invalid, information like document number, receiver can be unknown). What else should we consider and weigh, when making this decision?
There are few ways to combine the two tables without a common column including Cross Join (Cartesian Product) and UNION. This is not a join but can be useful for merging tables in SQL.
In SQL can a single column in a table reference multiple tables - no this is not possible. A foreign key always references one target table (and one table only).
Joins are used to combine the rows from multiple tables using mutual columns.
The majority of databases you'll work with as a developer will have more than one table, and those tables will be connected together in various ways to form table relationships.
Do try to make distinction between logical and physical modelling.
Even if the difference between the two entities is only seven properties, they are logically a different thing in those seven items. At the same time they are a same thing in other properties.
The way to logically represent that is this have one-to-one-or-zero relationship between the two tables, and to have one table store all the common properties (superclass) and in the other (subclass) you would only store the ID from the superclass.
In terms of performance this is not so bad:
Depending on the processes you are modelling, the frequency of these queries and other things (such as security for both entities, ownership, difference in integrity rules) you might decide to store this information in one table in the database or in two (either can be much faster in border-line cases and two table solution can also be denormalized a bit; for example you could still store information in a main table about the type of the document to avoid the join if that kind of query is all you care).
Or maybe your implementation decisions might be driven by your choice of application framework and for that reason you might really prefer working with single table or the other way around (for example automatic creation of data entry forms in frameworks such as django-admin).
Whatever you do, realize the difference between the logical and physical design. In your logical design normalize everything - it will pay off. In physical implementation make different scenarios and - test, test, test with your own data. Never confuse the order of the two (logical-conceptual and physical-practical modelling).
Wow, so much bad advice and design myths in a single question it's hard to know where to start.
Is this a VLDB? are you talking 100's of TB, 100's of GB, 1-10 GB?
Is this an untra-high performance DB? Do you need to squeeze out microseconds?
Most advice tends to lean toward those extremes where you might break a few basic rules for the sake of performance.
An earlier poster said,
"Whether the document is valid or invalid, it is still a document so it makes inital sense for them all to be in the same table."
He was on the right track there. And for that matter, whether it's processed or unprocessed it's also a document. I strongly question the first table split.
He then says,
"Having the two types of document together in the same table will do nothing but slow down your queries for no immediate benefit."
I have no idea what that advice is based upon. If your RDBMS supports indexes, more data will have a very marginal additional cost at certain sizes of your index because your b-tree gets one level deeper. If you take his statement at face value, you should limit your table to n rows each and keep making new ones because "more data in your table = slower queries." I have no idea why people persist in this notion. If you have queries that REQUIRE full table scans for one type or the other, let's talk partitioning, not a new table. It takes about an extra 10 milliseconds to find a row in a billion row table than it does in a million row table because an index will probably only be one blevel deeper between the two.
Another poster said,
"5-7 columns that do not apply to invalid documents NOT NULL so valid documents are required to have them. In my opinion, with that many columns empty for invalid documents, it justifies a different table."
I wish people would explain there reasons. HOW does it justify it? On what basis would you make that decision. Is 4 too many? Why not? But 5 is too many? Maybe he assumes you're using an ancient RDBMS with fixed field lengths. I can't tell. If you put the nullable columns at the end of the row, you'll pay no cost for them. In the middle, a few extra bytes. If that's a HUGE deal, if you're really scrambling to make this multi-TB table a wee smaller... we'll talk about vertical partitioning... not a whole new table. Since you'll be extending the length of n% of rows, you'll want to carefully choose your PCTFREE, or how ever your database does that. Other than that, there's little downside of the nullable columns.
So let's talk about all the downsides of three tables.
I'm going to assume your table looks like this;
A surrogate PK column with a unique index.
A candidate key column with a unique index.
a few foreign keys to 'lookup' tables.
Several data fields.
the 5-7 nullable columns that are filled if a document becomes invalid.
The first issue is that you'll have 3 PK's across all the tables to make sure that the key is unique... but there's no cross table object to guarantee uniqueness in all three combined. Unless you're painstaking in your approach to the code that moves data from one table to the next, you could have the same document twice or more. Once in each table. If you have a single table for Original, processed, and invalid, then there's no way you'll ever have that happen.
With three tables, all of your constraints are going to be validated over and over. When you do your insert into the Original table, the PK is validated, the AK is validate, the FKs are validated, the other columns are validate. Room is made in all of the indexes for these new enteries, and perhaps causing block splits. Now you process the file and delete the entry from the Original table, all of those indexes suffer deletes, leaving empty space behind. Your insert into the next table, suffers all of that cost of your first insert again. Your indexes are acted upon, maybe causing block splits, your PK, AK and FK's are all validated again. Lather rinse repeat for invalid table.
Now, what happens to your data model if you adopt this paradigm when you discover that the business needs a 4th state? You're going to add a 4th document table for those in the unsubmitted state, or sent state. After all, the new sent state has 5-7 columns unneeded by the other states.
And there are lots of queries which become hoorible to write and run with multiple tables, with a single table they are clear, concise and fast... size of a table will really only affect Full Table Scans, which we try to avoid for tables like these.
I've seen systems like these. One major operational query is, "Where is my document?"
You've got to search 3 tables to find its state. What most people do next is build a UNION ALL view of all three tables to facilitate the myriad of questions like that. IF the other poster thinks your queries slow down with other data in your table, see how they really slow down when you do a UNION ALL to accomplish the same thing. 1 index of blevel 3 as opposed to 3 indexes of blevel 2.
I work in a trading company. We execute trades with counterparties. For accounting and legal reasons our company is defined as several companies. Well call them Trading, Holding, JointVenture. Our counterparties we'll call. JonesCo, SmithBarely, GoldSax.
So if I consider that the internal companies have a unique set of columns and the counterparties have a unique set of columns. You'd say that proper normalization would force them into two tables. So let's do that.
INT_CO_T 1 Trading 2 Holding 3 JointVenture
CNTR_PTY_T 1 JonesCo 2 SmithBarely 3 GoldSax
Now I need a trade table where I map the transaction between our company(ies) and counterparties
TRADE_T (Int_co_T.ID, Ctr_pty_T.ID, other trade columns)
Great.
Whoops, Business says that the JointVenture will execute trades with Trading. BTW, This is a very common scenario, this happens all the time. Trading house would call these Book-to-Book trades.
Now I'm left with two choices. (Three really) but.
1 is that I could do something very silly and place JointVenture and Trading into the Counterparty table so that my mapping table will still work. This leads to nightmare queries which I'm sure those involved in this conversation will recognize. Or I can build a separate Mapping table.. and that too leads to some unions if I want to see all of the trades from a given company.
The third and better way is to build a single table for both counterparties and internal companies, called Trading_entities or something. Now I need one mapping table to show either internal or external trades. I can easily see net position and net exposure with one query, two tables. etc.
If you're really hung up on the nullable fields then vertically partition that table and use three tables. But the main table will have a list and most importantly a single key for either subtype of trading participant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With