Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalizing an extremely big table

I face the following issue. I have an extremely big table. This table is a heritage from the people who previously worked on the project. The table is in MS SQL Server.

The table has the following properties:

  1. it has about 300 columns. All of them have "text" type but some of them eventually should represent other types (for example, integer or datetime). So one has to convert this text values in appropriate types before using them
  2. the table has more than 100 milliom rows. The space for the table would soon reach 1 terabyte
  3. the table does not have any indices
  4. the table does not have any implemented mechanisms of partitioning.

As you may guess, it is impossible to run any reasonable query to this table. Now people only insert new records into the table but nobody uses it. So I need to restructure it. I plan to create a new structure and refill the new structure with the data from the old table. Obviously, I will implement partioning, but it is not the only thing to be done.

One of the most important features of the table is that those fields that are purely textual (i.e. they don't have to be converted into another type) usually have frequently repeated values. So the actual variety of values in a given column is in the range of 5-30 different values. This induces the idea to make normalization: for every such a textual column I will create an additional table with the list of all the different values that may appear in this column, then I will create a (tinyint) primary key in this additional table and then will use an appropriate foreign key in the original table instead of keeping those text values in the original table. Then I will put an index on this foreign key column. The number of the columns to be processed this way is about 100.

It raises the following questions:

  1. would this normalization really increase the speed of the queires imposing conditions on some of those 100 fields? If we forget about the size needed to keep those columns, whether would there be any increase in the performance due to the substition of the initial text-columns with tinyint-columns? If I do not do any normalization and simply put an index on those initial text columns, whether the performace will be the same as for the index on the planned tinyint-column?
  2. If I do the described normalization, then building a view showing the text values will require joining my main table with some 100 additional tables. A positive moment is that I'll do those joins for pairs "primary key"="foreign key". But still quite a big amount of tables should be joined. Here is the question: whether the performance of the queryes made to this view compare to the performance of the queries to the initial non-normalized table will be not worse? Whether the SQL Server Optimizer will really be able to optimize the query the way that allows taking the benefits of the normalization?

Sorry for such a long text.

Thanks for every comment!

PS I created a related question regarding joining 100 tables; Joining 100 tables

like image 282
iCoffee Avatar asked Feb 07 '13 19:02

iCoffee


4 Answers

You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...

However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA or LOB_DATA Allocation Units.

By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.

As far as the overhead added by performing JOIN to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.

like image 93
Michael Fredrickson Avatar answered Nov 07 '22 19:11

Michael Fredrickson


Whether this is worth the effort depends on how long the values are. If the values are, say, state abbreviations (2 characters) or country codes (3 characters), the resulting table would be even larger than the existing one. Remember, you need to include the primary key of the reference table. That would typically be an integer and occupy four bytes.

There are other good reasons to do this. Having reference tables with valid lists of values maintains database consistency. The reference tables can be used both to validate inputs and for reporting purposes. Additional information can be included, such as a "long name" or something like that.

Also, SQL Server will spill varchar columns over onto additional pages. It does not spill other types. You only have 300 columns but eventually your record data might get close to the 8k limit for data on a single page.

And, if you decide to go ahead, I would suggest that you look for "themes" in the columns. There may be groups of columns that can be grouped together . . . detailed stop code and stop category, short business name and full business name. You are going down the path of modelling the data (a good thing). But be cautious about doing things at a very low level (managing 100 reference tables) versus identifying a reasonable set of entities and relationships.

like image 24
Gordon Linoff Avatar answered Nov 07 '22 19:11

Gordon Linoff


1) The system is currently having to do a full table scan on very significant amounts of data, leading to the performance issues. There are many aspects of optimisation which could improve this performance. The conversion of columns to the correct data types would not only significantly improve performance by reducing the size of each record, but would allow data to be made correct. If querying on a column, you're currently looking at the text being compared to the text in the field. With just indexing, this could be improved, but changing to a lookup would allow the ID value to be looked up from a table small enough to keep in memory and then use this to scan just integer values, which is a much quicker process. 2) If data is normalised to 3rd normal form or the like, then you can see instances where performance suffers in the name of data integrity. This is most a problem if the engine cannot work out how to restrict the rows without projecting the data first. If this does occur, though, the execution plan can identify this and the query can be amended to reduce the likelihood of this.

Another point to note is that it sounds like if the database was properly structured it may be able to be cached in memory because the amount of data would be greatly reduced. If this is the case, then the performance would be greatly improved.

The quick way to improve performance would probably be to add indexes. However, this would further increase the overall database size, and doesn't address the issue of storing duplicate data and possible data integrity issues.

There are some other changes which can be made - if a lot of the data is not always needed, then this can be separated off into a related table and only looked up as needed. Fields that are not used for lookups to other tables are particular candidates for this, as the joins can then be on a much smaller table, while preserving a fairly simple structure that just looks up the additional data when you've identified the data you actually need. This is obviously not a properly normalised structure, but may be a quick and dirty way to improve performance (after adding indexing).

like image 38
David Burton Avatar answered Nov 07 '22 18:11

David Burton


  1. Construct in your head and onto paper a normalized database structure
  2. Construct the database (with indexes)
  3. De-construct that monolith. Things will not look so bad. I would guess that A LOT (I MEAN A LOT) of data is repeated
  4. Create SQL insert statements to insert the data into the database
  5. Go to the persons that constructed that nightmare in the first place with a shotgun. Have fun.
like image 1
Ed Heal Avatar answered Nov 07 '22 20:11

Ed Heal