Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TSql, building indexes before or after data input

Performance question about indexing large amounts of data. I have a large table (~30 million rows), with 4 of the columns indexed to allow for fast searching. Currently I set the indexs (indices?) up, then import my data. This takes roughly 4 hours, depending on the speed of the db server. Would it be quicker/more efficient to import the data first, and then perform index building?

like image 682
John Avatar asked Oct 28 '10 12:10

John


3 Answers

I'd temper af's answer by saying that it would probably be the case that "index first, insert after" would be slower than "insert first, index after" where you are inserting records into a table with a clustered index, but not inserting records in the natural order of that index. The reason being that for each insert, the data rows themselves would be have to be ordered on disk.

As an example, consider a table with a clustered primary key on a uniqueidentifier field. The (nearly) random nature of a guid would mean that it is possible for one row to be added at the top of the data, causing all data in the current page to be shuffled along (and maybe data in lower pages too), but the next row added at the bottom. If the clustering was on, say, a datetime column, and you happened to be adding rows in date order, then the records would naturally be inserted in the correct order on disk and expensive data sorting/shuffling operations would not be needed.

I'd back up Winston Smith's answer of "it depends", but suggest that your clustered index may be a significant factor in determining which strategy is faster for your current circumstances. You could even try not having a clustered index at all, and see what happens. Let me know?

like image 179
Neil Moss Avatar answered Oct 21 '22 16:10

Neil Moss


Inserting data while indices are in place causes DBMS to update them after every row. Because of this, it's usually faster to insert the data first and create indices afterwards. Especially if there is that much data.

(However, it's always possible there are special circumstances which may cause different performance characteristics. Trying it is the only way to know for sure.)

like image 45
af. Avatar answered Oct 21 '22 17:10

af.


It will depend entirely on your particular data and indexing strategy. Any answer you get here is really a guess.

The only way to know for sure, is to try both and take appropriate measurements, which won't be difficult to do.

like image 3
Winston Smith Avatar answered Oct 21 '22 17:10

Winston Smith