Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing a Flexible Relationship in a RDBMS -- What really are the tradeoffs?

I have a bunch of products with a bunch of different possible attributes for each product. E.g. Product A has a name, size, color, shape. Product B has a name, calories, sugar, etc. One way to solve this is like:

1) Create tables

Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)

This allows for maximum flexibility, but I have heard a lot of people recommend against this although I am not sure why. I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.

Everyone who explains to me why this is bad does so in the context of a completely flexible relational design where you don't ever create real tables past a basic few basic initial tables (e.g. object, attribute, object_attribute)-- which I think we all can agree is bad. But this is a much more limited and contained version of that (only Products, not every object in the system), so I don't think it is fair to group these two architectures together.

What issues have you encountered (experience or theoretical) that makes this design so bad?

2) Another way to solve this is to create a Product table with a bunch of columns like Size, Color, Shape, Weight, Sugar, etc and then include some extra columns at the end to give us some flexibility. This will create generally sparse rows filled mostly with NULLs. People tend to like this approach, but my question is how many columns can you have before this approach loses its performance benefits? If you have 200 columns, I imagine this is no longer a smart move, but what about 100 columns? 50 columns? 25 columns?

3) The final approach I know about is to store all of the attributes as a blob (JSON perhaps) in a single column of the Products table. I like this approach but it doesn't feel right. Queries are hard. And if you want to be able to easily change the name of an attribute later, you either have to parse every record individually or have them keyed in your blob by some id. If you go the id path then you will need another table Attributes and things start to look like approach #1 from above except you won't be able to join the attribute_id with your blob, so I hope you didn't want to query anything by attribute name.

What I like about this approach though is you can query one product and in your code you can easily access all the properties it has -- fast. And if you delete a product, you won't have to cleanup other tables -- easy to stay consistent.

4) I have read some things about being able to index strongly typed xml formats in some RDBMSs, but I honestly don't know much about this approach.

I am stuck. I feel like approach #1 is the best bet, but everything I read says that way stinks. What is the right way to think about this problem to be able to decide what is the best method for a given situation? More ideas than what I have listed are obviously welcomed!

like image 337
Davis Dimitriov Avatar asked Jun 24 '11 19:06

Davis Dimitriov


3 Answers

You can probably find a great deal about this topic by doing a Google search on "entity attribute value antipattern".

One of the issues with this approach is that you end up mixing meta-data with actual data. Your "attribute" has to now tell the database what exactly is held in the "value" column. This can make it very difficult to handle this data in front-ends, reporting software, etc.

Second, you're going to have a very hard time actually enforcing any data integrity in the database. When your product has an attribute of "weight" what's to stop someone from putting "22 inches" in the value? Or a non-numeric value completely. You might say, "Well, my application will handle that." Then you need to change your application every time that you want to add a new attribute because the application needs to know how to handle it. If you're going to go through all of that work, just add a new column.

Third, how do you enforce that a given product has all of the attributes that it needs? In a row you can make column NOT NULL and they are then required to get that row into the database. You can't enforce that in the EAV model.

Fourth, this kind of a model usually leads to a lot of confusion. People aren't sure what "attributes" are supported, or they duplicate an attribute, or they forget to handle an attribute when creating a report. As an example, if I have an attribute for "Weight(kg)" and another attribute for "Weight(lbs)" and someone asks me, "What's the heaviest product in your database?" I'd better remember that I need to check both attributes.

Fifth, this model usually also leads to laziness. Hey, there's no reason to actually do any analysis of the products that our system can handle, because whatever comes along we'll just add some attributes. In my experience, companies are much better off doing the analysis required to create a good database design rather than fall back on an antipattern like this. You'll learn things about the database, the application, and likely the business as well.

Sixth, it might take a LOT of joins to get a single row of data for a given product. You can return the attributes as separate rows, but now you have to come up with customized list boxes to list those products, etc. Similarly, writing search queries against this model can be very difficult and in both of these situations you're likely to have performance issues.

These are just a few of the problems which I've encountered over the years. I'm sure that there are others.

What the correct solution is for your system depends a lot on the specifics of your business and application. Rather than a sparse row, you might consider using subtype tables if your products fall into a few categories that share common attributes.

like image 60
Tom H Avatar answered Oct 24 '22 19:10

Tom H


There are many problems with flexible data models but the first one that is likely to bite you is the fact that queries get unwieldy very quickly. For example, if you wanted to get the Size attribute for every product, the query is relatively easy.

SELECT p.name product_name, 
       pa.value product_size
  FROM product p    
         left outer join product_attribute pa on (p.product_id = pa.product_id)
         left outer join attribute a on (pa.attribute_id = a.attribute_id and 
                                         a.name          = 'size')

If you want to get the size and some other attribute like color, things get trickier

SELECT p.name product_name, 
       pa_size.value product_size
       pa_color.value product_color
  FROM product p    
         left outer join product_attribute pa_size on (p.product_id = pa_size.product_id)
         left outer join product_attribute pa_color on (p.product_id = pa_size.product_id)
         left outer join attribute a_size on (pa_size.attribute_id = a.attribute_id and 
                                              a_size.name          = 'size')
         left outer join attribute a_color on (pa_color.attribute_id = a.attribute_id and
                                              a_color.name         = 'color')

Very quickly, when you start wanting to grab 10 attributes or write complex searches (show me products where the color is blue and the size is medium), the queries start to get very complicated both for developers to write and maintain and for the database optimizer to generate the query plan for. If you're joining 30 tables together, the optimizer would have to prune the tree of plans it considers very, very quickly to be able to generate a query plan in a reasonable time frame. That tends to lead the optimizer to discard promising paths too early and to generate less than optimal paths for many of your queries.

This, in turn, means that you very quickly get to a point where new development is bottlenecked because developers can't get their queries right or developers can't get their queries to return quickly enough. Whatever time you saved up front by not gathering the requirements to determine what the valid attributes are quickly gets used up with the 47th iteration of "Why can't I get the data I want out of this putrid data model?"

Beyond this cost to developers, you end up creating a lot of costs for the organization as a whole.

  • No query tool is going to handle this sort of data model well. So all the users that can currently fire up their favorite query tool and run some reports out of your database are now stuck waiting for developers to write their reports and do their extracts for them.
  • Data quality becomes very hard to enforce. It becomes very hard to check conditions that involve multiple attributes (i.e. if a product's size Medium then the weight must be between 1 and 10 pounds, if a product's height is specified then a width is required as well) so people don't make those checks. They don't write the reports to identify where these sorts of rules are violated. So the data ends up being a bit bucket of data that downstream processes decide they can't use because it isn't sufficiently complete.
  • You're moving too much of the initial requirements discussion off into the future when understanding the core entities will likely lead to a much better design overall. If you can't agree on a set of attributes that the first version of the product needs to support, you don't really understand what that version is supposed to do. Even if you successfully code a very generic application, that means that it is going to require a lot of time to configure once you've built it (because someone will have to figure out what attributes it supports at that point). And then you'll discover when the application is being configured that you missed a ton of requirements that only became clear when the attributes were defined-- you can't know that width is required if height is specified if you don't know whether they're going to store height or width in the first place.
    In the worst case, the response to this problem during configuration is to immediately determine that you need to provide a flexible way to specify business rules and to specify workflows so that the people configuring the application can quickly code their business rules when they add new attributes and so that they can control the flow of the application by grouping attributes together or skipping certain pages (i.e. have a page where make & model are required if the product type is car, skip that page if now). But in order to do that, you're going to end up building an entire development environment. And you're going to push the job of actually coding the application to the folks that are configuring the product. Unless you happen to be really good at building development environments, and unless the people configuring the product are really developers, this doesn't end well.
like image 44
Justin Cave Avatar answered Oct 24 '22 19:10

Justin Cave


I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.

No, we wouldn't. Here's why.

You started with this.

Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)

Let's drop the id numbers, so we can see what's really going on. (Longer column names for clarity.)

Products (product_name)
Attributes (attribute_name)
Product_Attributes (product_name, attribute_name, value as string)

And translating that to teams and players . . .

Teams (team_name)
Players (player_name)
Team_Players (team_name, player_name, value as string)

So for sample data we might have

Team                   Player             Value
--
St. Louis Cardinals    Boggs, Mitchell    ?
St. Louis Cardinals    Carpenter, Chris   ?
St. Louis Cardinals    Franklin, Ryan     ?
St. Louis Cardinals    Garcia, Jaime      ?

What on earth belongs in place of the question marks? Let's say we want to record number of games played. Now the sample data looks like this.

Team                   Player             Value
--
St. Louis Cardinals    Boggs, Mitchell    23
St. Louis Cardinals    Carpenter, Chris   15
St. Louis Cardinals    Franklin, Ryan     19
St. Louis Cardinals    Garcia, Jaime      14

Want to store batting average, too? You can't. Not only can you not store batting average along with games played, you can't tell by looking at the database whether Mitch Boggs played in 23 games, had 23 hits, scored 23 runs, had 23 "at bats", had 23 singles, or struck out 23 times.

like image 2
Mike Sherrill 'Cat Recall' Avatar answered Oct 24 '22 19:10

Mike Sherrill 'Cat Recall'