User Defined Fields (UDFs) are custom fields that can be set up if there is data you'd like to track that doesn't already have a field available in the software.To learn how to create and manage custom UDFs, please first view article 9386. Below are some common uses of User Defined Fields in the MIP software.
If performance is the primary concern, I would go with #6... a table per UDF (really, this is a variant of #2). This answer is specifically tailored to this situation and the description of the data distribution and access patterns described.
Because you indicate that some UDFs have values for a small portion of the overall data set, a separate table would give you the best performance because that table will be only as large as it needs to be to support the UDF. The same holds true for the related indices.
You also get a speed boost by limiting the amount of data that has to be processed for aggregations or other transformations. Splitting the data out into multiple tables lets you perform some of the aggregating and other statistical analysis on the UDF data, then join that result to the master table via foreign key to get the non-aggregated attributes.
You can use table/column names that reflect what the data actually is.
You have complete control to use data types, check constraints, default values, etc. to define the data domains. Don't underestimate the performance hit resulting from on-the-fly data type conversion. Such constraints also help RDBMS query optimizers develop more effective plans.
Should you ever need to use foreign keys, built-in declarative referential integrity is rarely out-performed by trigger-based or application level constraint enforcement.
This could create a lot of tables. Enforcing schema separation and/or a naming convention would alleviate this.
There is more application code needed to operate the UDF definition and management. I expect this is still less code needed than for the original options 1, 3, & 4.
If there is anything about the nature of the data that would make sense for the UDFs to be grouped, that should be encouraged. That way, those data elements can be combined into a single table. For example, let's say you have UDFs for color, size, and cost. The tendency in the data is that most instances of this data looks like
'red', 'large', 45.03
rather than
NULL, 'medium', NULL
In such a case, you won't incur a noticeable speed penalty by combining the 3 columns in 1 table because few values would be NULL and you avoid making 2 more tables, which is 2 fewer joins needed when you need to access all 3 columns.
If you hit a performance wall from a UDF that is heavily populated and frequently used, then that should be considered for inclusion in the master table.
Logical table design can take you to a certain point, but when the record counts get truly massive, you also should start looking at what table partitioning options are provided by your RDBMS of choice.
I have written about this problem a lot. The most common solution is the Entity-Attribute-Value antipattern, which is similar to what you describe in your option #3. Avoid this design like the plague.
What I use for this solution when I need truly dynamic custom fields is to store them in a blob of XML, so I can add new fields at any time. But to make it speedy, also create additional tables for each field you need to search or sort on (you don't a table per field--just a table per searchable field). This is sometimes called an inverted index design.
You can read an interesting article from 2009 about this solution here: http://backchannel.org/blog/friendfeed-schemaless-mysql
Or you can use a document-oriented database, where it's expected that you have custom fields per document. I'd choose Solr.
This sounds like a problem that might be better solved by a non-relational solution, like MongoDB or CouchDB.
They both allow for dynamic schema expansion while allowing you to maintain the tuple integrity you seek.
I agree with Bill Karwin, the EAV model is not a performant approach for you. Using name-value pairs in a relational system is not intrinsically bad, but only works well when the name-value pair make a complete tuple of information. When using it forces you to dynamically reconstruct a table at run-time, all kinds of things start to get hard. Querying becomes an exercise in pivot maintenance or forces you to push the tuple reconstruction up into the object layer.
You can't determine whether a null or missing value is a valid entry or lack of entry without embedding schema rules in your object layer.
You lose the ability to efficiently manage your schema. Is a 100-character varchar the right type for the "value" field? 200-characters? Should it be nvarchar instead? It can be a hard trade-off and one that ends with you having to place artificial limits on the dynamic nature of your set. Something like "you can only have x user-defined fields and each can only be y characters long.
With a document-oriented solution, like MongoDB or CouchDB, you maintain all attributes associated with a user within a single tuple. Since joins are not an issue, life is happy, as neither of these two does well with joins, despite the hype. Your users can define as many attributes as they want (or you will allow) at lengths that don't get hard to manage until you reach about 4MB.
If you have data that requires ACID-level integrity, you might consider splitting the solution, with the high-integrity data living in your relational database and the dynamic data living in a non-relational store.
I would most probably create a table of the following structure:
The exact types of course depend on your needs (and of course on the dbms you are using). You could also use the NumberValue (decimal) field for int's and booleans. You may need other types as well.
You need some link to the Master records which own the value. It's probably easiest and fastest to create a user fields table for each master table and add a simple foreign key. This way you can filter master records by user fields easily and quickly.
You may want to have some kind of meta data information. So you end up with the following:
Table UdfMetaData
Table MasterUdfValues
Whatever you do, I would not change the table structure dynamically. It is a maintenance nightmare. I would also not use XML structures, they are much too slow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With