Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Table design options for large number of rows?

I have an application that sends data based on user interaction (not user input). The data sent could be an Integer, String, Date, or Boolean value. There are 140 keys. We may get anywhere from 1 key value pair to all 140 at a time.

We want to store everything but will only be using 20 out of 140 keys within the application. The remaining will be used for an audit trail later on - so we still need to store them.

This data is used by the application to decide where the user needs to go so it needs to access the record by student id and pull the 20 or so options within milliseconds. There could be billions of rows of data (it is an upgrade to an existing application with over 20,000 users) so performance is critical. The user generates a new row each time they access the application.

EXAMPLE DATA:

Score:1
ID:3212
IsLast:False
Action:Completed

I have 2 ideas on how to do this and looking for some help on which is best or is a third option a better choice.

OPTION 1:

My first idea is to use a column for the value as a string then have a look-up table of possible data types to use when the value needs to be Cast for use.

value       | dataType
-----------------------
"1"         | int
"Completed" | string

While the data being sent is not user generated I know there must be a gotcha somewhere in this method. The only reason for doing this is that we don't know what key:pair will be sent (outside of date and id) and trying to avoid more than a few columns.

The SO Question How to Handle Unknown Data Type in one Table uses a similar idea.

OPTION 2:

The other solution is to have 140 columns - one for each key. However, the amount of data generated is very large (billions of rows) so that calling this data will not be fast enough - I don't think.

Technical Details: This is using SQL Server 2008 - not R2 with DotNet C# and Reporting Services.

Am I missing something here - what is the best way to create this table for performance?

like image 202
Todd Moses Avatar asked Feb 24 '10 15:02

Todd Moses


2 Answers

Vertically segment your data. Put the 20 keys that are necessary for navigational control in one table, all 20 in one row, with PK that identifies the user Interaction (Callit say, InteractionId). Put the other 120 values in another table, with composite Primary Key, based on the PK of the first table (InteractionId, plus the KeyTypeId identifying which of the 120 possible key value pairs the value is for. Store all the values in this second table as strings. In a third lookup table called, say, KeyTypes, store the KeyTypeId, KeyTypeName, and KeyValueDataType to allow your code to know how to cast the string value to output it properly as either a string, datetime, an integer, or a decimal value or whatever...

The first table will be accessed much more often, and so it contains only those values which the application's navigational functionality needs more frequent access to, keeping the rows narrower, which allows more rows per page, and minimizes disk IO. Putting all 20 values in one row will keep the row count smaller (~ 1/20th as large), minimizng the depth of the index seeks that will need to be performed for each access.

The other table with all the other 120 key-values will not be accessed as frequently, so it's structure can probably be optimized for logical simplicity rather than for performance.

like image 56
Charles Bretana Avatar answered Sep 24 '22 04:09

Charles Bretana


Actually, you might merge the suggestions offered so far:

Create a table with the 20 keys necessary for navigational control, plus one column for a Primary Key, plus one column that is an XML data type to store the rest of the possible data. You could then create a DTD that handles the data types for each key, plus constraints on certain keys as needed.

like image 34
Adrian J. Moreno Avatar answered Sep 25 '22 04:09

Adrian J. Moreno