Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustered indexes on non-identity columns to speed up bulk inserts?

My two questions are:

  • Can I use clustered indexes to speed up bulk inserts in big tables?
  • Can I then still efficiently use foreign key relationships if my IDENTITY column is not the clustered index anymore?

To elaborate, I have a database with a couple of very big (between 100-1000 mln rows) tables containing company data. Typically there is data about 20-40 companies in such a table, each as their own "chunk" marked by "CompanyIdentifier" (INT). Also, every company has about 20 departments, each with their own "subchunk" marked by "DepartmentIdentifier" (INT).

It frequently happens that a whole "chunk" or "subchunk" is added or removed from the table. My first thought was to use Table Partitioning on those chunks, but since I am using SQL Server 2008 Standard Edition I am not entitled to it. Still, most queries I have are executed on a "chunk" or "subchunk" rather than on the table as a whole.

I have been working to optimize these tables for the following functions:

  1. Queries that are run on subchunks
  2. "Benchmarking" queries that are run on the table as a whole
  3. Inserting/removing big chunks of data.

For 1) and 2) I haven't encountered a lot of problems. I have created several indexes on key fields (also containing CompanyIdentifier and DepartmentIdentifier where useful) and the queries are running fine.

But for 3) I have struggled to find a good solution. My first strategy was to always disable indexes, bulk insert a big chunk and rebuild indexes. This was very fast in the beginning, but now that there are a lot of companies in the database, it takes a very long time to rebuild the index each time.

At the moment my strategy has changed to just leaving the index on while inserting, since this seems to be faster now. But I want to optimize the insert speed even further.

I seem to have noticed that by adding a clustered index defined on CompanyIdentifier + DepartmentIdentifier, the loading of new "chunks" into the table is faster. Before I had abandoned this strategy in favour of adding a clustered index on an IDENTITY column, as several articles pointed out to me that the clustered index is contained in all other indexes and so the clustered index should be as small as possible. But now I am thinking of reviving this old strategy to speed up the inserts. My question, would this be wise, or will I suffer performance hits in other areas? And will this really speed up my inserts or is that just my imagination?

I am also not sure whether in my case an IDENTITY column is really needed. I would like to be able to establish foreign key relationships with other tables, but can I also use something like a CompanyIdentifier+DepartmentIdentifier+[uniquifier] scheme for that? Or does it have to be a table-wide, fragmented IDENTITY number?

Thanks a lot for any suggestions or explanations.

like image 967
littlegreen Avatar asked Sep 17 '10 08:09

littlegreen


2 Answers

Well, I've put it to the test, and putting a clustered index on the two "chunk-defining" columns increases the performance of my table.

Inserting a chunk is now relatively fast compared to the situation where I had a clustered IDENTITY key, and about as fast as when I did not have any clustered index. Deleting a chunk is faster than with or without clustered index.

I think the fact that all the records I want to delete or insert are guaranteed to be all together on a certain part of the harddisk makes the tables faster - it would seem logical to me.


Update: After a year of experience with this design I can say that for this approach to work, it is necessary to schedule regular rebuilding of all the indexes (we do it once a week). Otherwise, the indexes become fragmented very soon and performance is lost. Nevertheless, we are in a process of migration to a new database design with partitioned tables, which is basically better in every way - except for the Enterprise Server license cost, but we've already forgotten about it by now. At least I have.

like image 67
littlegreen Avatar answered Sep 27 '22 18:09

littlegreen


A clustered index is a physical index, a physical data structure, a row order. If you insert in the middle of the clustered index, the data will be physically inserted in the middle of the present data. I imagine a serious performance issue in this case. I only know this from theory, because if I do this in practice, it will be a mistake according to my theoretical knowledge.

Therefore, I only use (and advise the use) of clustered indexes on fields that are always, physically, inserted at the end, preserving the order.

A clustered index can be placed on a datetime field which marks the moment of insertion or something like that, because physically they will be ordered after appending a row. Identity is a good clustered index also, but not always relevant for querying.

In your solution you place a [uniquifier] field, but why do this when you can put an identity that will do just that? It will be unique, physically ordered, small (for foreign keys in other tables means smaller index), and in some cases faster.

Can't you try this, experiment? I have a similar situation here, where I have 4 billion rows, constantly more are inserting (up to 100 per second), the table has no primary key and no clustered index, so the propositions in this topic are VERY interesting for me too.

like image 37
Alex Avatar answered Sep 27 '22 18:09

Alex