After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database schema, however I'm not sure if this is a good idea since it doesn't follow the rules of normalization and I would like to hear your advice, here is the general idea:
I've four types of users modeled in a Class Table Inheritance structure, in the main "user" table I store data common to all the users (id, username, password, several flags, ...) along with some TIMESTAMP fields (date_created, date_updated, date_activated, date_lastLogin, ...).
To quote the tip #16 from the Nettuts+ article mentioned above:
Example 2: You have a “last_login” field in your table. It updates every time a user logs in to the website. But every update on a table causes the query cache for that table to be flushed. You can put that field into another table to keep updates to your users table to a minimum.
Now it gets even trickier, I need to keep track of some user statistics like
In my fully normalized database this adds up to about 8 to 10 additional tables, it's not a lot but I would like to keep things simple if I could, so I've come up with the following "events" table:
|------|----------------|----------------|---------------------|-----------| | ID | TABLE | EVENT | DATE | IP | |------|----------------|----------------|---------------------|-----------| | 1 | user | login | 2010-04-19 00:30:00 | 127.0.0.1 | |------|----------------|----------------|---------------------|-----------| | 1 | user | login | 2010-04-19 02:30:00 | 127.0.0.1 | |------|----------------|----------------|---------------------|-----------| | 2 | user | created | 2010-04-19 00:31:00 | 127.0.0.2 | |------|----------------|----------------|---------------------|-----------| | 2 | user | activated | 2010-04-19 02:34:00 | 127.0.0.2 | |------|----------------|----------------|---------------------|-----------| | 2 | user | approved | 2010-04-19 09:30:00 | 217.0.0.1 | |------|----------------|----------------|---------------------|-----------| | 2 | user | login | 2010-04-19 12:00:00 | 127.0.0.2 | |------|----------------|----------------|---------------------|-----------| | 15 | user_ads | created | 2010-04-19 12:30:00 | 127.0.0.1 | |------|----------------|----------------|---------------------|-----------| | 15 | user_ads | impressed | 2010-04-19 12:31:00 | 127.0.0.2 | |------|----------------|----------------|---------------------|-----------| | 15 | user_ads | clicked | 2010-04-19 12:31:01 | 127.0.0.2 | |------|----------------|----------------|---------------------|-----------| | 15 | user_ads | clicked | 2010-04-19 12:31:02 | 127.0.0.2 | |------|----------------|----------------|---------------------|-----------| | 15 | user_ads | clicked | 2010-04-19 12:31:03 | 127.0.0.2 | |------|----------------|----------------|---------------------|-----------| | 15 | user_ads | clicked | 2010-04-19 12:31:04 | 127.0.0.2 | |------|----------------|----------------|---------------------|-----------| | 15 | user_ads | clicked | 2010-04-19 12:31:05 | 127.0.0.2 | |------|----------------|----------------|---------------------|-----------| | 2 | user | blocked | 2010-04-20 03:19:00 | 217.0.0.1 | |------|----------------|----------------|---------------------|-----------| | 2 | user | deleted | 2010-04-20 03:20:00 | 217.0.0.1 | |------|----------------|----------------|---------------------|-----------| Basically the ID refers to the primary key (id) field in the TABLE table, I believe the rest should be pretty straightforward. One thing that I've come to like in this design is that I can keep track of all the user logins instead of just the last one, and thus generate some interesting metrics with that data.
Due to the growing nature of the events table I also thought of making some optimizations, such as:
TABLE and EVENTS columns could be setup as ENUMs instead of VARCHARs to save some space.IPs as UNSIGNED INTs with INET_ATON() instead of VARCHARs.DATEs as TIMESTAMPs instead of DATETIMEs.ARCHIVE (CSV?InnoDB / MyISAM. INSERTs and SELECTs are supported, and data is compressed on the fly.Overall, each event would only consume 14 (uncompressed) bytes which is okay for my traffic I guess.
SELECT * FROM events WHERE id = 2 AND table = 'user' ORDER BY date DESC();ID, TABLE and EVENT).I'm more inclined to go with this approach since the pros seem to far outweigh the cons, but I'm still a little bit reluctant... Am I missing something? What are your thoughts on this?
Thanks!
One thing that I do slightly differently is to maintain an entity_type table, and use its ID in the object_type column (in your case, the 'TABLE' column). You would want to do the same thing with an event_type table.
Just to be clear, you mean I should add an additional table that maps which events are allowed in a table and use the PK of that table in the events table instead of having a TABLE / EVENT pair?
These are all statistics derived from existing data, aren't they?
The additional tables are mostly related to statistics but I the data doesn't already exists, some examples:
user_ad_stats user_post_stats ------------- --------------- user_ad_id (FK) user_post_id (FK) ip ip date date type (impressed, clicked) If I drop these tables I've no way to keep track of who, what or when, not sure how views can help here.
I agree that it ought to be separate, but more because it's fundamentally different data. What someone is and what someone does are two different things. I don't think volatility is so important.
I've heard it both ways and I couldn't find anything in the MySQL manual that states that either one is right. Anyway, I agree with you that they should be separated tables because they represent kinds of data (with the added benefit of being more descriptive than a regular approach).
I think you're missing the forest for the trees, so to speak.
The predicate for your table would be "User ID from IP IP at time DATE EVENTed to TABLE" which seems reasonable, but there are issues.
What I meant for "not as bad as EAV" is that all records follow a linear structure and they are pretty easy to query, there is no hierarchical structure so all queries can be done with a simple SELECT.
Regarding your second statement, I think you understood me wrong here; the IP address is not necessarily associated with the user. The table structure should read something like this:
IP address (
IP) did something (EVENT) to the PK (ID) of the table (TABLE) on date (DATE).
For instance, in the last row of my example above it should read that IP 217.0.0.1 (some admin), deleted the user #2 (whose last known IP is 127.0.0.2) at 2010-04-20 03:20:00.
You can still join, say, user events to users, but you can't implement a foreign key constraint.
Indeed, that's my main concern. However I'm not totally sure what can go wrong with this design that couldn't go wrong with a traditional relational design. I can spot some caveats but as long as the app messing with the database knows what it is doing I guess there shouldn't be any problems.
One other thing that counts in this argument is that I will be storing much more events, and each event will more than double compared to the original design, it makes perfect sense to use the ARCHIVE storage engine here, the only thing is it doesn't support FKs (neither UPDATEs or DELETEs).
The event table is a database table created by the user, generally within the same schema as the application table for which it stores events. The event table describes the type of change made to an application table, and also contains an identifier for the changed row.
Within a database, related data are grouped into tables, each of which consists of rows (also called tuples) and columns, like a spreadsheet. To convert your lists of data into tables, start by creating a table for each type of entity, such as products, sales, customers, and orders.
I highly recommend this approach. Since you're presumably using the same database for OLTP and OLAP, you can gain significant performance benefits by adding in some stars and snowflakes.
I have a social networking app that is currently at 65 tables. I maintain a single table to track object (blog/post, forum/thread, gallery/album/image, etc) views, another for object recommends, and a third table to summarize insert/update activity in a dozen other tables.
One thing that I do slightly differently is to maintain an entity_type table, and use its ID in the object_type column (in your case, the 'TABLE' column). You would want to do the same thing with an event_type table.
Clarifying for Alix - Yes, you maintain a reference table for objects, and a reference table for events (these would be your dimension tables). Your fact table would have the following fields:
id object_id event_id event_time ip_address
It looks like a pretty reasonable design, so I just wanted to challenge a few of your assumptions to make sure you had concrete reasons for what you're doing.
In my fully normalized database this adds up to about 8 to 10 additional tables
These are all statistics derived from existing data, aren't they? (Update: okay, they're not, so disregard following.) Why wouldn't these simply be views, or even materialized views?
It may seem like a slow operation to gather those statistics, however:
I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads
I guess you're talking about how the user (just to pick one table) events, which would be pretty volatile, are separated from the user data. I agree that it ought to be separate, but more because it's fundamentally different data. What someone is and what someone does are two different things.
I don't think volatility is so important. The DBMS should already allow you to put the log file and database file on separate devices, which accomplishes the same thing, and contention shouldn't be an issue with row-level locking.
Non-relational (still not as bad as EAV)
I think you're missing the forest for the trees, so to speak.
The predicate for your table would be "User ID from IP IP at time DATE EVENTed to TABLE" which seems reasonable, but there are issues. (Update: Okay, so it's sort of kinda like that.)
You can still join, say, user events to users, but you can't implement a foreign key constraint. That's why EAV is generally problematic; whether or not something is exactly EAV doesn't really matter. It's generally one or two lines of code to implement a constraint in your schema, but in your app it could be dozens of lines of code, and if the same data is accessed in multiple places by multiple apps, it can easily multiply to thousands of lines of code. So, generally, if you can prevent bad data with a foreign key constraint, you're guaranteed that no app will do that.
You might think that events aren't so important, but, as an example, ad impressions are money. I would definitely want to catch any bugs relating to ad impressions as early in the design process as possible.
Further comment
I can spot some caveats but as long as the app messing with the database knows what it is doing I guess there shouldn't be any problems.
And with some caveats you can make a very successful system. With a proper system of constraints, you get to say, "if any app messing with the database doesn't know what it's doing, the DBMS will flag an error." That may require a more time and money than you've got, so something simpler that you can have is probably better than something more perfect that you can't. C'est la vie.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With