The tool currently used is Informatica and we have bookend stored procedures that drop the clustered indexes and then add them back to the database. In the stored procedure where we add the clustered indexes back we have the DDL for the indexes hard coded into the stored procedure (we don't use sys tables because a fear of Microsoft changing the sys tables and regen from there creates a bad index or fails). This causes issues where people have created the clustered indexes but not thought to update the stored procedure and the next time bulk occurs these indexes are gone. We previously did this for all indexes but switched non clustered indexes to using disable/rebuild. This not an option though because we will no longer be able to insert into the table if this is done to the clustered index because it is essentially the table.
Performance is important but not everything. Good performance and easy maintainability trumps great performance and complex maintainability.
After reading many sites it is almost universally agreed that when performing bulk insert, on data not ordered the same as your primary key, inserting into a heap and then applying the pk afterwards is faster ( http://msdn.microsoft.com/en-us/library/ms177445.aspx , http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx). Most these sites make assumptions that I cannot use at my organization and with my toolset.
Currently due to our current standards policies we have to use FULL recovery model so minimally logging will not occur no matter which selection that I make in reference to heap vs clustered index.
According to our informatica admins specifying tablock or order hints on bcp is not possible through the UI and our organization is adverse to customization beyond the UI because of maintainability.
So the question after all of this is with all the factors above would you recommend that we continue with our somewhat unreliable stored procedures, insert into a clustered index or have some third far superior solution. I also realize there is other stack questions similar to this item but they do not address bulk specifically and/or make similar assumptions in their answers.
HEAP indexes cannot be built on BLOB or TEXT fields. HEAP tables cannot use partial keys (prefixes) HEAP tables do not support AUTO_INCREMENT fields. HEAP indexes can only use the '=' and '<=>' operators.
Bulk loading fills rowgroups to maximum capacity and compresses them directly into the columnstore. Only rows at the end of a load that don't meet the minimum of 102,400 rows per rowgroup go to the deltastore. To perform a bulk load, you can use bcp Utility, Integration Services, or select rows from a staging table.
Minimal logging involves logging only the limited information required to rollback a transaction. Minimal logging doesn't support PITR (Point-in-time recovery). In other words, one cannot restore a T-log backup to a certain point-in-time if the backup file contains any bulk-logged transactions.
In SQL Server, you can include up-to 1023 columns per each non-clustered index. But you have to add minimum one key column to your non-clustered index in order to create it.
My suggestion would be to bulk load into a staging table (a heap, or CI matching the file order), (re-)build the clustered index there matching the destination table, and then insert straight from the staging table. To reduce blocking, escalation, log use etc. you could do this in batches of 10000 rows at a time, committing and/or checkpointing every so often.
You might also consider using a pre-processor (C# perhaps) that takes the log file and builds a new one with the proper sort order.
Also I think you are safer using sys.indexes etc. than hard-coding the index structures in the code. Microsoft is far less likely to change a column name in sys.indexes than someone at your shop (no offense intended) will change an index but forget to update the hard-coded definition in the procedure.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With