How to link many tables containing the same products with different columns and IDs

Tags:

I have a number of tables from different suppliers containing information about the products they supply. The formats of the tables are different—different ids, different description, etc.—because each supplier stores separate information about their products. Some products in different tables might be the same but I don't know that at the time the rows are inserted; I have to go through them manually to determine which products are the same.

For example, consider the following tables (SQL Fiddle):

CREATE TABLE A (
  id char(10)PRIMARY KEY,
  name char(16),
  color char(16),
  weight float
  );

CREATE TABLE B (
  id int(11) PRIMARY KEY,
  name varchar(60),
  color char(3)
  );

Each table may have different ID values, columns, and even values for the same column; e.g.:

Table A:
+------------+-----------------+---------------+-----------------+
|         ID |            NAME |         COLOR |          WEIGHT |
+------------+-----------------+---------------+-----------------+
| RFY-55-001 |  Wagon, Classic |           Red | 15.199999809265 |
| RFY-62-001 | Trike, My First |           Red |   8.60000038147 |
| RFY-64-001 |  Trike, 12 Inch |           Red |  15.39999961853 |
| SWN-35-001 | Trike, Roadster | Metallic Blue | 20.700000762939 |
| SWN-35-002 | Trike, Roadster |        Silver | 20.700000762939 |
| SWN-35-003 | Trike, Roadster |    Cherry Red | 20.700000762939 |
+------------+-----------------+---------------+-----------------+

Table B:
+-------+--------------------------------------------+--------+
|    ID |                                       NAME |  COLOR |
+-------+--------------------------------------------+--------+
| 10560 |                  Schwinn Roadster Tricycle |    BLU |
| 10685 |              Radio Flyer Classic Red Wagon | (null) |
| 10880 | Radio Flyer Classic Red Dual Deck Tricycle | (null) |
| 12008 |         Fisher-Price I Can Play Basketball | (null) |
+-------+--------------------------------------------+--------+

Supplier A stores a short product name, but includes detailed product information in additional columns. Supplier B stores a more detailed product name, but not much additional information. In both cases, the manufacturer's identity is mixed into some other field.

The rows are inserted automatically and I can't check each product before it is inserted. I can only go over them once in a while and update the links I find manually. For example, a little research shows that the Radio Flyer Classic Red Dual Deck Tricycle has a 12" front wheel and weighs 15.4 pounds, indicating that item RFY-64-001 in Table A is the same product as item 10880 in Table B.

What relations would I use to "link" rows in the different tables if they refer to the same product, sometime after inserting them, so that I would have one key for each unique product?

The obvious solution would be to add a connecting table with my key and use triggers when inserting to the other tables. The problem with this solution is that the "sync" process becomes pretty manual, and prone to mistakes.

603

asked Dec 28 '14 20:12

Talor

1 Answers

Okay; so you have a number of tables originating from different sources, each implementing the same entity (Product) with their own logical structure. Any given record may exist in more than one table and your task is to reconcile these different data sources, using some manual process for determining which records are similar/identical across tables.

This sounds similar to a process that we go through at Cal/EPA to reconcile data from many different sources in creating a statewide inventory of air pollution. These sources include several different modeling teams, each working with a distinct set of input and output parameters; direct measurements from monitoring stations; annual reports from local and municipal agencies; and in some cases, data reported directly to us by individuals and businesses. This data comes to us in many different formats and much of it overlaps.

I won't bore you with the details of our reconciliation process, but I can tell you some of the things I've learned from it and suggest an approach that may work for you.

First, putting together overlapping data that exists in many different formats sucks. Our jobs would be much easier if everyone just used the same table structures and stayed out of each other's way. That said, this is the reality we have to deal with; this is why we have jobs. Different organizations and individuals operate differently, period, and most of the time you won't have the power to change the way they operate.

You need a plan, or else mistakes will be made—not occasional mistakes, but constant ones. One broad approach you can take is: I will accept data from all these sources, in all these formats and put it into a staging database, where I will manipulate and aggregate and chop and slap and maim the data as necessary to fit it into my shiny, well-behaved database where there's only one physical model and everything is wonderful. I can then forget about all the unspeakable things I did to the data to fit it in there in the first place, and life will be wonderful... until I have to do it all over again. This is called ETL, which stands for Extract, Transform, Load—or possibly Entrap, Torture, Lobotomize depending on how much fun you're having. You can read about it on Wikipedia if you like (or here on SO, or here on DBA.SE, or in this guide from a company that knows a thing or two about torture).

The nice thing about ETL is that you end up somewhere that feels good. The not-so-nice thing about ETL is that by transforming the data to fit your particular needs, you often lose some degree of detail or utility that was present in the source. On the flip side, if you create a highly generalized data model in order to capture as much of that source data as possible and avoid destructive transformations, you end up with a highly generalized data model, which implies more documentation and maintenance work even if your model is perfect—which it never is.

Now, it sounds like you're either unwilling or unable to transform your source data to fit a reasonably focused model. And you should not cram it all into one table full of repeating columns and NULLs, lest your rapid descent into madness be punctuated by visions of Cthulhu in his house at R'lyeh:

+-----+---------+------------+---------+-----------------+-------------+-----+
| id  | sup1_id | sup1_color | sup2_id | sup2_color      | sup2_weight | ... |
+-----+---------+------------+---------+-----------------+-------------+-----+
| 1   | 7124    | brn        | 93      | Burnt Sienna    | 0.65        | ... |
| 2   | 415     | yel        | 8552    | Bananas Foster  | 12.50       | ... |
| 3   | NULL    | NULL       | 51      | Mostly Red      | 2.00        | ... |
| 4   | 159     | wht        | NULL    | NULL            | NULL        | ... |
| 5   | NULL    | NULL       | NULL    | NULL            | NULL        | NULL NULLNULLNULLUNNLUNUL gratuitous Unicode diacritics, you get the idea

That doesn't belong in a relational database at all; it's a flat file/spreadsheet paradigm that should be restricted to non-database environments like Excel. (Or Access. Zing!)

What you're left with, if you need to retain the disparate tables in their various forms but still want to relate them to one another and keep some kind of "master" record, is creating additional tables to characterize those relationships and store your "one key to rule them all." Now we're heading in the direction of talking about subtypes and roles. You have one entity (Product) that's represented in various ways; if each instance of a product exists in exactly one representation, then you're dealing with entity subtypes. In the case where there is overlap, as there is here, it's better to think of roles that instances can take on with each supplier. Here's an example from Data Modeling, A Beginner's Guide:

Your "master" ID and any associated information would belong to the relation on the left, which I'll just call Product. This is where you want to store attributes that will always be the same for a given product, no matter the supplier; or, the "official" version of an attribute that you know differs among suppliers, if such is required. For example, if you are a manufacturer, your MSRP for a product does not depend on the retailer carrying the product; it should be in the main table. It may also serve as an "official" price in contrast with the various list prices associated with different retailers.

Your various supplier tables are on the right. These are the roles that a product takes on; here would be stored any information that you want to be able to vary from supplier to supplier (in your example, product color) or that is associated some suppliers and not others (in your example, weight). Since there is a many-to-many relationship between products and roles, you add a junction table between them, where roles are assigned; you don't need to have dates in your junction table, as in the above image, if you're not keeping a history, but that's an example where information concerning the role assignment would be appropriate to store in the junction table.

Since you specified that you need to link together records from the supplier tables after they're inserted, via your mystery manual process, your workflow under this approach would probably be something like:

Records are inserted into various supplier tables.
You notice that new records have been inserted (here's where a trigger could be useful).
Using your magical mystery manual process, you identify whether each new record does or does not already exist in Product.
Whenever a new record from the supplier does not exist in Product, you insert a new row into Product to uniquely identify (and perhaps characterize) that item.
You insert a new row into the junction table for each new record from a supplier, linking the unique record for that item to the additional information characterizing it in its role with that supplier.

Note that because your foreign keys will be in the junction table, it's possible to have products with no role/supplier and to have suppliers that carry no products. As for keeping up with the workflow and avoiding mistakes, unless you're prepared to give us much more information about your process for linking products between supplier tables, the best I can suggest is to use AFTER INSERT triggers on each of the supplier tables to put the name of the table and the PK of the row into a table that keeps track of new products until they've gone through your manual linking process, and remove these rows as you go through your manual linking process.

You could use another trigger on the junction table to clear the new/unlinked products table but it's riskier to have a trigger updating or deleting rows. In fact, if you can populate your new/unlinked products table using application logic instead of triggers, that might be preferable as well. You can read through the blog Triggers Considered Harmful, Considered Harmful for a thoughtful analysis of both sides of the trigger debate, though it's not specific to MySQL.

answered Oct 09 '22 15:10

Air

Related questions
                            
                                Does dropping a MySQL table automatically drop that table's indices too? [duplicate]
                            
                                How do I generate a script of all database objects in Toad for MySQL
                            
                                How are Reddit and Hacker News ranking algorithms used?
                            
                                How to update fields that is the aggregate result of another table in MySQL?
                            
                                MySQL Workbench: "Can't connect to MySQL server on 127.0.0.1' (10061)" error
                            
                                MySQL create view joining two tables
                            
                                Database design for comments and replies
                            
                                Difference between sql statements and clause
                            
                                How to use foreign key when querying from two tables
                            
                                How to retrieve full data of record by its foreign key in laravel?
                            
                                Fatal error: Call to a member function query() on null
                            
                                Insert to database multiple options from HTML form via PHP
                            
                                Optimize innodb table
                            
                                Converting mysql tables from latin1 to utf8
                            
                                MySQL: setting time_zone in my.cnf options file
                            
                                Create innodb database in mysql
                            
                                Difference between these two joining table approaches?
                            
                                MySQL Index is bigger than the data stored
                            
                                MySQL group_concat_max_len in a query
                            
                                MySQL Query with LARGE number of records gets Killed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to link many tables containing the same products with different columns and IDs

Tags:

sql

mysql

database-design

Talor

People also ask

1 Answers

Air

Recent Activity

Donate For Us