I have a number of tables from different suppliers containing information about the products they supply. The formats of the tables are different—different ids, different description, etc.—because each supplier stores separate information about their products. Some products in different tables might be the same but I don't know that at the time the rows are inserted; I have to go through them manually to determine which products are the same.
For example, consider the following tables (SQL Fiddle):
CREATE TABLE A (
id char(10)PRIMARY KEY,
name char(16),
color char(16),
weight float
);
CREATE TABLE B (
id int(11) PRIMARY KEY,
name varchar(60),
color char(3)
);
Each table may have different ID values, columns, and even values for the same column; e.g.:
Table A:
+------------+-----------------+---------------+-----------------+
| ID | NAME | COLOR | WEIGHT |
+------------+-----------------+---------------+-----------------+
| RFY-55-001 | Wagon, Classic | Red | 15.199999809265 |
| RFY-62-001 | Trike, My First | Red | 8.60000038147 |
| RFY-64-001 | Trike, 12 Inch | Red | 15.39999961853 |
| SWN-35-001 | Trike, Roadster | Metallic Blue | 20.700000762939 |
| SWN-35-002 | Trike, Roadster | Silver | 20.700000762939 |
| SWN-35-003 | Trike, Roadster | Cherry Red | 20.700000762939 |
+------------+-----------------+---------------+-----------------+
Table B:
+-------+--------------------------------------------+--------+
| ID | NAME | COLOR |
+-------+--------------------------------------------+--------+
| 10560 | Schwinn Roadster Tricycle | BLU |
| 10685 | Radio Flyer Classic Red Wagon | (null) |
| 10880 | Radio Flyer Classic Red Dual Deck Tricycle | (null) |
| 12008 | Fisher-Price I Can Play Basketball | (null) |
+-------+--------------------------------------------+--------+
Supplier A stores a short product name, but includes detailed product information in additional columns. Supplier B stores a more detailed product name, but not much additional information. In both cases, the manufacturer's identity is mixed into some other field.
The rows are inserted automatically and I can't check each product before it is inserted. I can only go over them once in a while and update the links I find manually. For example, a little research shows that the Radio Flyer Classic Red Dual Deck Tricycle has a 12" front wheel and weighs 15.4 pounds, indicating that item RFY-64-001
in Table A is the same product as item 10880
in Table B.
What relations would I use to "link" rows in the different tables if they refer to the same product, sometime after inserting them, so that I would have one key for each unique product?
The obvious solution would be to add a connecting table with my key and use triggers when inserting to the other tables. The problem with this solution is that the "sync" process becomes pretty manual, and prone to mistakes.
Yes, it is possible for a table to have more than one column which can uniquely identify a row of data. A column that can uniquely identify a record of data is known as a "Candidate Key" .
Using the “FROM Table1, Table2” Syntax One way to join two tables without a common column is to use an obsolete syntax for joining tables. With this syntax, we simply list the tables that we want to join in the FROM clause then use a WHERE clause to add joining conditions if necessary.
The join is done by the JOIN operator. In the FROM clause, the name of the first table ( product ) is followed by a JOIN keyword then by the name of the second table ( category ). This is then followed by the keyword ON and by the condition for joining the rows from the different tables.
To join two tables based on a column match without loosing any of the data from the left table, you would use a LEFT OUTER JOIN. Left outer joins are used when you want to get all the values from one table but only the records that match the left table from the right table.
Okay; so you have a number of tables originating from different sources, each implementing the same entity (Product) with their own logical structure. Any given record may exist in more than one table and your task is to reconcile these different data sources, using some manual process for determining which records are similar/identical across tables.
This sounds similar to a process that we go through at Cal/EPA to reconcile data from many different sources in creating a statewide inventory of air pollution. These sources include several different modeling teams, each working with a distinct set of input and output parameters; direct measurements from monitoring stations; annual reports from local and municipal agencies; and in some cases, data reported directly to us by individuals and businesses. This data comes to us in many different formats and much of it overlaps.
I won't bore you with the details of our reconciliation process, but I can tell you some of the things I've learned from it and suggest an approach that may work for you.
First, putting together overlapping data that exists in many different formats sucks. Our jobs would be much easier if everyone just used the same table structures and stayed out of each other's way. That said, this is the reality we have to deal with; this is why we have jobs. Different organizations and individuals operate differently, period, and most of the time you won't have the power to change the way they operate.
You need a plan, or else mistakes will be made—not occasional mistakes, but constant ones. One broad approach you can take is: I will accept data from all these sources, in all these formats and put it into a staging database, where I will manipulate and aggregate and chop and slap and maim the data as necessary to fit it into my shiny, well-behaved database where there's only one physical model and everything is wonderful. I can then forget about all the unspeakable things I did to the data to fit it in there in the first place, and life will be wonderful... until I have to do it all over again. This is called ETL, which stands for Extract, Transform, Load—or possibly Entrap, Torture, Lobotomize depending on how much fun you're having. You can read about it on Wikipedia if you like (or here on SO, or here on DBA.SE, or in this guide from a company that knows a thing or two about torture).
The nice thing about ETL is that you end up somewhere that feels good. The not-so-nice thing about ETL is that by transforming the data to fit your particular needs, you often lose some degree of detail or utility that was present in the source. On the flip side, if you create a highly generalized data model in order to capture as much of that source data as possible and avoid destructive transformations, you end up with a highly generalized data model, which implies more documentation and maintenance work even if your model is perfect—which it never is.
Now, it sounds like you're either unwilling or unable to transform your source data to fit a reasonably focused model. And you should not cram it all into one table full of repeating columns and NULLs, lest your rapid descent into madness be punctuated by visions of Cthulhu in his house at R'lyeh:
+-----+---------+------------+---------+-----------------+-------------+-----+
| id | sup1_id | sup1_color | sup2_id | sup2_color | sup2_weight | ... |
+-----+---------+------------+---------+-----------------+-------------+-----+
| 1 | 7124 | brn | 93 | Burnt Sienna | 0.65 | ... |
| 2 | 415 | yel | 8552 | Bananas Foster | 12.50 | ... |
| 3 | NULL | NULL | 51 | Mostly Red | 2.00 | ... |
| 4 | 159 | wht | NULL | NULL | NULL | ... |
| 5 | NULL | NULL | NULL | NULL | NULL | NULL NULLNULLNULLUNNLUNUL gratuitous Unicode diacritics, you get the idea
That doesn't belong in a relational database at all; it's a flat file/spreadsheet paradigm that should be restricted to non-database environments like Excel. (Or Access. Zing!)
What you're left with, if you need to retain the disparate tables in their various forms but still want to relate them to one another and keep some kind of "master" record, is creating additional tables to characterize those relationships and store your "one key to rule them all." Now we're heading in the direction of talking about subtypes and roles. You have one entity (Product) that's represented in various ways; if each instance of a product exists in exactly one representation, then you're dealing with entity subtypes. In the case where there is overlap, as there is here, it's better to think of roles that instances can take on with each supplier. Here's an example from Data Modeling, A Beginner's Guide:
Your "master" ID and any associated information would belong to the relation on the left, which I'll just call Product. This is where you want to store attributes that will always be the same for a given product, no matter the supplier; or, the "official" version of an attribute that you know differs among suppliers, if such is required. For example, if you are a manufacturer, your MSRP for a product does not depend on the retailer carrying the product; it should be in the main table. It may also serve as an "official" price in contrast with the various list prices associated with different retailers.
Your various supplier tables are on the right. These are the roles that a product takes on; here would be stored any information that you want to be able to vary from supplier to supplier (in your example, product color) or that is associated some suppliers and not others (in your example, weight). Since there is a many-to-many relationship between products and roles, you add a junction table between them, where roles are assigned; you don't need to have dates in your junction table, as in the above image, if you're not keeping a history, but that's an example where information concerning the role assignment would be appropriate to store in the junction table.
Since you specified that you need to link together records from the supplier tables after they're inserted, via your mystery manual process, your workflow under this approach would probably be something like:
Note that because your foreign keys will be in the junction table, it's possible to have products with no role/supplier and to have suppliers that carry no products. As for keeping up with the workflow and avoiding mistakes, unless you're prepared to give us much more information about your process for linking products between supplier tables, the best I can suggest is to use AFTER INSERT
triggers on each of the supplier tables to put the name of the table and the PK of the row into a table that keeps track of new products until they've gone through your manual linking process, and remove these rows as you go through your manual linking process.
You could use another trigger on the junction table to clear the new/unlinked products table but it's riskier to have a trigger updating or deleting rows. In fact, if you can populate your new/unlinked products table using application logic instead of triggers, that might be preferable as well. You can read through the blog Triggers Considered Harmful, Considered Harmful for a thoughtful analysis of both sides of the trigger debate, though it's not specific to MySQL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With