Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamic columns in database tables vs EAV

I'm trying to decide which way to go if I have an app that needs to be able to change the db schema based on the user input.

For example, if I have a "car" object that contains car properties, like year, model, # of doors etc, how do I store it in the DB in such a way, that the user should be able to add new properties?

I read about EAV tables and they seem right for this thing, but the problem is that queries will get pretty complicated when I try to get a list of cars filtered by a set of properties.

Could I generate the tables dynamically instead? I see that Sqlite has support for ADD COLUMN, but how fast is it when the table reaches many records? And it looks like there's no way to remove a column. I have to create a new table without the column I want to remove, and copy the data from the old table. That's certainly slow on large tables :(

like image 599
Alex Avatar asked May 08 '15 13:05

Alex


People also ask

What is an EAV table in a database?

I know an ISV that has an EAV table in their database because they let customers and 3rd party apps integrate tightly with it, including adding their own attributes to data. In addition to the columns for E, A, and V, there's basically also a column for AttributeSource. (Values can be the customer's name, the 3rd party integration app, etc.)

Why are EAV tables so sparsely populated?

There are a few reasons. Some ok, some not ok. Reason 1: Sparse Data EAV tables first appeared for situations where there was a large number of potential attributes (ie: columns here) and each entity (ie: row here) only had a few of them.

What is the difference between row-modelling and EAV modelling?

Moreover, the data type of a value column in a row-modelled table is pre-determined, whereas in an EAV table, a value's data type depends on a attribute recorded in a particular row. It can be difficult to choose the best approach for modelling data, but as a guideline you would consider EAV modelling over row modelling when:

What are the advantages of using Entity Framework attribute tables (EAV)?

The main advantage of using EAV is its flexibility. Table holding attributes describing an entity is not limited to a specific number of columns, meaning that it doesn't require a schema redesign every time new attribute needs to be introduced.


2 Answers

Another option that I haven't seen mentioned above is to use denormalized tables for the extended attributes. This is a combination of the pseudo-generic model and the dynamic columns in database tables. Instead of adding columns to existing tables, you add columns or groups of columns into new tables with FK indexes to the source table. Of course, you'll want a good naming convention (car, car_attributes_door, car_attributes_littleOldLadies)

  • Your selection problem becomes that of applying a LEFT OUTER JOIN to include the extended attributes that you want to include.
    • Slower than normalized, but not as slow as EAV.
  • Adding new extended attributes becomes a problem of adding a new table.
    • Harder than EAV, easier/faster than modifying table schema.
  • Deleting attributes becomes a problem of dropping whole tables.
    • Easier/faster than modifying table schema.
  • These new attributes can be strongly typed.
    • As good as modifying table schema, faster than EAV or generic columns.

The biggest advantage to this approach that I can see is that deleting unused attributes is quite easy compared to any of the others via a single DROP TABLE command. You also have the option to later normalize often-used attributes into larger groups or into the main table using a single ALTER TABLE process rather than one for each new column you were adding as you added them, which helps with the slow LEFT OUTER JOIN queries.

The biggest disadvantage is that you're cluttering up your table list, which admittedly is often not a trivial concern. That and I'm not sure how much better LEFT OUTER JOIN's actually perform than EAV table joins. It's definitely closer to EAV join performance than normalized table performance.

If you're doing a lot of comparisons/filters of values that benefit greatly from strongly typed columns, but you add/remove these columns frequently enough to make modifying a huge normalized table intractable, this seems like a good compromise.

like image 104
JVal90 Avatar answered Oct 15 '22 12:10

JVal90


I will assume that SQLite (or another relational DBMS) is a requirement.

EAVs

I have worked with EAVs and generic data models, and I can say that the data model is very messy and hard to work with in the long run.

Lets say that you design a datamodel with three tables: entities, attributes, and _entities_attributes_:

CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY, name TEXT);

CREATE TABLE attributes 
(attribute_id INTEGER PRIMARY KEY, name TEXT, type TEXT);

CREATE TABLE entity_attributes 
(entity_id INTEGER, attribute_id INTEGER, value TEXT, 
PRIMARY KEY(entity_id, attribute_id));

In this model, the entities table will hold your cars, the attributes table will hold the attributes that you can associate to your cars (brand, model, color, ...) and its type (text, number, date, ...), and the _entity_attributes_ will hold the values of the attributes for a given entity (for example "red").

Take into account that with this model you can store as many entities as you want and they can be cars, houses, computers, dogs or whatever (ok, maybe you need a new field on entities, but it's enough for the example).

INSERTs are pretty straightforward. You only need to insert a new object, a bunch of attributes and its relations. For example, to insert a new entity with 3 attributes you will need to execute 7 inserts (one for the entity, three more for the attributes, and three more for the relations.

When you want to perform an UPDATE, you will need to know what is the entity that you want to update, and update the desired attribute joining with the relation between the entity and its attributes.

When you want to perform a DELETE, you will also need to need to know what is the entity you want to delete, delete its attributes, delete the relation between your entity and its attributes and then delete the entity.

But when you want to perform a SELECT the thing becomes nasty (you need to write really difficult queries) and the performance drops horribly.

Imagine a data model to store car entities and its properties as in your example (say that we want to store brand and model). A SELECT to query all your records will be

SELECT brand, model FROM cars;

If you design a generic data model as in the example, the SELECT to query all your stored cars will be really difficult to write and will involve a 3 table join. The query will perform really bad.

Also, think about the definition of your attributes. All your attributes are stored as TEXT, and this can be a problem. What if somebody makes a mistake and stores "red" as a price?

Indexes are another thing that you could not benefit of (or at least not as much as it would be desirable), and they are very neccesary as the data stored grows.

As you say, the main concern as a developer is that the queries are really hard to write, hard to test and hard to maintain (how much would a client have to pay to buy all red, 1980, Pontiac Firebirds that you have?), and will perform very poorly when the data volume increases.

The only advantage of using EAVs is that you can store virtually everything with the same model, but is like having a box full of stuff where you want to find one concrete, small item.

Also, to use an argument from authority, I will say that Tom Kyte argues strongly against generic data models: http://tkyte.blogspot.com.es/2009/01/this-should-be-fun-to-watch.html https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056

Dynamic columns in database tables

On the other hand, you can, as you say, generate the tables dynamically, adding (and removing) columns when needed. In this case, you can, for example create a car table with the basic attributes that you know that you will use and then add columns dynamically when you need them (for example the number of exhausts).

The disadvantage is that you will need to add columns to an existing table and (maybe) build new indexes.

This model, as you say, also has another problem when working with SQLite as there's no direct way to delete columns and you will need to do this as stated on http://www.sqlite.org/faq.html#q11

BEGIN TRANSACTION;
CREATE TEMPORARY TABLE t1_backup(a,b);
INSERT INTO t1_backup SELECT a,b FROM t1;
DROP TABLE t1;
CREATE TABLE t1(a,b);
INSERT INTO t1 SELECT a,b FROM t1_backup;
DROP TABLE t1_backup;
COMMIT;

Anyway, I don't really think that you will need to delete columns (or at least it will be a very rare scenario). Maybe someone adds the number of doors as a column, and stores a car with this property. You will need to ensure that any of your cars have this property to prevent from losing data before deleting the column. But this, of course depends on your concrete scenario.

Another drawback of this solution is that you will need a table for each entity you want to store (one table to store cars, another to store houses, and so on...).

Another option (pseudo-generic model)

A third option could be to have a pseudo-generic model, with a table having columns to store id, name, and type of the entity, and a given (enough) number of generic columns to store the attributes of your entities.

Lets say that you create a table like this:

CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY,
 name TEXT,
 type TEXT,
 attribute1 TEXT,
 attribute1 TEXT,
 ...
 attributeN TEXT
 );

In this table you can store any entity (cars, houses, dogs) because you have a type field and you can store as many attributes for each entity as you want (N in this case).

If you need to know what the attribute37 stands for when type is "red", you would need to add another table that relates the types and attributes with the description of the attributes.

And what if you find that one of your entities needs more attributes? Then simply add new columns to the entities table (attributeN+1, ...).

In this case, the attributes are always stored as TEXT (as in EAVs) with it's disadvantages.

But you can use indexes, the queries are really simple, the model is generic enough for your case, and in general, I think that the benefits of this model are greater than the drawbacks.

Hope it helps.


Follow up from the comments:

With the pseudo-generic model your entities table will have a lot of columns. From the documentation (https://www.sqlite.org/limits.html), the default setting for SQLITE_MAX_COLUMN is 2000. I have worked with SQLite tables with over 100 columns with great performance, so 40 columns shouldn't be a big deal for SQLite.

As you say, most of your columns will be empty for most of your records, and you will need to index all of your colums for performance, but you can use partial indexes (https://www.sqlite.org/partialindex.html). This way, your indexes will be small, even with a high number of rows, and the selectivity of each index will be great.

If you implement a EAV with only two tables, the number of joins between tables will be less than in my example, but the queries will still be hard to write and maintain, and you will need to do several (outer) joins to extract data, which will reduce performance, even with a great index, when you store a lot of data. For example, imagine that you want to get the brand, model and color of your cars. Your SELECT would look like this:

SELECT e.name, a1.value brand, a2.value model, a3.value color
FROM entities e
LEFT JOIN entity_attributes a1 ON (e.entity_id = a1.entity_id and a1.attribute_id = 'brand')
LEFT JOIN entity_attributes a2 ON (e.entity_id = a2.entity_id and a2.attribute_id = 'model')
LEFT JOIN entity_attributes a3 ON (e.entity_id = a3.entity_id and a3.attribute_id = 'color');

As you see, you would need one (left) outer join for each attribute you want to query (or filter). With the pseudo-generic model the query will be like this:

SELECT name, attribute1 brand, attribute7 model, attribute35 color
FROM entities;

Also, take into account the potential size of your _entity_attributes_ table. If you can potentially have 40 attributes for each entity, lets say that you have 20 not null for each of them. If you have 10,000 entities, your _entity_attributes_ table will have 200,000 rows, and you will be querying it using one huge index. With the pseudo-generic model you will have 10,000 rows and one small index for each column.

like image 33
antonio Avatar answered Oct 15 '22 12:10

antonio