I have a bunch of products with a bunch of different possible attributes for each product. E.g. Product A has a name, size, color, shape. Product B has a name, calories, sugar, etc. One way to solve this is like:
1) Create tables
Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)
This allows for maximum flexibility, but I have heard a lot of people recommend against this although I am not sure why. I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.
Everyone who explains to me why this is bad does so in the context of a completely flexible relational design where you don't ever create real tables past a basic few basic initial tables (e.g. object, attribute, object_attribute)-- which I think we all can agree is bad. But this is a much more limited and contained version of that (only Products, not every object in the system), so I don't think it is fair to group these two architectures together.
What issues have you encountered (experience or theoretical) that makes this design so bad?
2) Another way to solve this is to create a Product table with a bunch of columns like Size, Color, Shape, Weight, Sugar, etc and then include some extra columns at the end to give us some flexibility. This will create generally sparse rows filled mostly with NULLs. People tend to like this approach, but my question is how many columns can you have before this approach loses its performance benefits? If you have 200 columns, I imagine this is no longer a smart move, but what about 100 columns? 50 columns? 25 columns?
3) The final approach I know about is to store all of the attributes as a blob (JSON perhaps) in a single column of the Products table. I like this approach but it doesn't feel right. Queries are hard. And if you want to be able to easily change the name of an attribute later, you either have to parse every record individually or have them keyed in your blob by some id. If you go the id path then you will need another table Attributes and things start to look like approach #1 from above except you won't be able to join the attribute_id with your blob, so I hope you didn't want to query anything by attribute name.
What I like about this approach though is you can query one product and in your code you can easily access all the properties it has -- fast. And if you delete a product, you won't have to cleanup other tables -- easy to stay consistent.
4) I have read some things about being able to index strongly typed xml formats in some RDBMSs, but I honestly don't know much about this approach.
I am stuck. I feel like approach #1 is the best bet, but everything I read says that way stinks. What is the right way to think about this problem to be able to decide what is the best method for a given situation? More ideas than what I have listed are obviously welcomed!
You can probably find a great deal about this topic by doing a Google search on "entity attribute value antipattern".
One of the issues with this approach is that you end up mixing meta-data with actual data. Your "attribute" has to now tell the database what exactly is held in the "value" column. This can make it very difficult to handle this data in front-ends, reporting software, etc.
Second, you're going to have a very hard time actually enforcing any data integrity in the database. When your product has an attribute of "weight" what's to stop someone from putting "22 inches" in the value? Or a non-numeric value completely. You might say, "Well, my application will handle that." Then you need to change your application every time that you want to add a new attribute because the application needs to know how to handle it. If you're going to go through all of that work, just add a new column.
Third, how do you enforce that a given product has all of the attributes that it needs? In a row you can make column NOT NULL and they are then required to get that row into the database. You can't enforce that in the EAV model.
Fourth, this kind of a model usually leads to a lot of confusion. People aren't sure what "attributes" are supported, or they duplicate an attribute, or they forget to handle an attribute when creating a report. As an example, if I have an attribute for "Weight(kg)" and another attribute for "Weight(lbs)" and someone asks me, "What's the heaviest product in your database?" I'd better remember that I need to check both attributes.
Fifth, this model usually also leads to laziness. Hey, there's no reason to actually do any analysis of the products that our system can handle, because whatever comes along we'll just add some attributes. In my experience, companies are much better off doing the analysis required to create a good database design rather than fall back on an antipattern like this. You'll learn things about the database, the application, and likely the business as well.
Sixth, it might take a LOT of joins to get a single row of data for a given product. You can return the attributes as separate rows, but now you have to come up with customized list boxes to list those products, etc. Similarly, writing search queries against this model can be very difficult and in both of these situations you're likely to have performance issues.
These are just a few of the problems which I've encountered over the years. I'm sure that there are others.
What the correct solution is for your system depends a lot on the specifics of your business and application. Rather than a sparse row, you might consider using subtype tables if your products fall into a few categories that share common attributes.
There are many problems with flexible data models but the first one that is likely to bite you is the fact that queries get unwieldy very quickly. For example, if you wanted to get the Size attribute for every product, the query is relatively easy.
SELECT p.name product_name,
pa.value product_size
FROM product p
left outer join product_attribute pa on (p.product_id = pa.product_id)
left outer join attribute a on (pa.attribute_id = a.attribute_id and
a.name = 'size')
If you want to get the size and some other attribute like color, things get trickier
SELECT p.name product_name,
pa_size.value product_size
pa_color.value product_color
FROM product p
left outer join product_attribute pa_size on (p.product_id = pa_size.product_id)
left outer join product_attribute pa_color on (p.product_id = pa_size.product_id)
left outer join attribute a_size on (pa_size.attribute_id = a.attribute_id and
a_size.name = 'size')
left outer join attribute a_color on (pa_color.attribute_id = a.attribute_id and
a_color.name = 'color')
Very quickly, when you start wanting to grab 10 attributes or write complex searches (show me products where the color is blue and the size is medium), the queries start to get very complicated both for developers to write and maintain and for the database optimizer to generate the query plan for. If you're joining 30 tables together, the optimizer would have to prune the tree of plans it considers very, very quickly to be able to generate a query plan in a reasonable time frame. That tends to lead the optimizer to discard promising paths too early and to generate less than optimal paths for many of your queries.
This, in turn, means that you very quickly get to a point where new development is bottlenecked because developers can't get their queries right or developers can't get their queries to return quickly enough. Whatever time you saved up front by not gathering the requirements to determine what the valid attributes are quickly gets used up with the 47th iteration of "Why can't I get the data I want out of this putrid data model?"
Beyond this cost to developers, you end up creating a lot of costs for the organization as a whole.
I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.
No, we wouldn't. Here's why.
You started with this.
Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)
Let's drop the id numbers, so we can see what's really going on. (Longer column names for clarity.)
Products (product_name)
Attributes (attribute_name)
Product_Attributes (product_name, attribute_name, value as string)
And translating that to teams and players . . .
Teams (team_name)
Players (player_name)
Team_Players (team_name, player_name, value as string)
So for sample data we might have
Team Player Value
--
St. Louis Cardinals Boggs, Mitchell ?
St. Louis Cardinals Carpenter, Chris ?
St. Louis Cardinals Franklin, Ryan ?
St. Louis Cardinals Garcia, Jaime ?
What on earth belongs in place of the question marks? Let's say we want to record number of games played. Now the sample data looks like this.
Team Player Value
--
St. Louis Cardinals Boggs, Mitchell 23
St. Louis Cardinals Carpenter, Chris 15
St. Louis Cardinals Franklin, Ryan 19
St. Louis Cardinals Garcia, Jaime 14
Want to store batting average, too? You can't. Not only can you not store batting average along with games played, you can't tell by looking at the database whether Mitch Boggs played in 23 games, had 23 hits, scored 23 runs, had 23 "at bats", had 23 singles, or struck out 23 times.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With