Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Optimisation: Inserting 200 million rows into database

I have the following (simplified) code which I'd like to optimise for speed:

long inputLen = 50000000; // 50 million 
DataTable dataTable = new DataTable();
DataRow dataRow;
object[] objectRow;
while (inputLen--)
{
    objectRow[0] = ...
    objectRow[1] = ...
    objectRow[2] = ...

    // Generate output for this input
    output = ...

    for (int i = 0; i < outputLen; i++) // outputLen can range from 1 to 20,000
    {
         objectRow[3] = output[i];
         dataRow = dataTable.NewRow();
         dataRow.ItemArray = objectRow;
         dataTable.Rows.Add(dataRow);
    }
}

// Bulk copy
SqlBulkCopy bulkTask = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, null);
bulkTask.DestinationTableName = "newTable";
bulkTask.BatchSize = dataTable.Rows.Count;
bulkTask.WriteToServer(dataTable);
bulkTask.Close();

I'm already using SQLBulkCopy in an attempt to speed things up, but it appears assigning values to the DataTable itself proves to be slow.

I don't know how DataTables work so I'm wondering if I'm creating unnecessary overhead by first creating a reusable array, then assigning it to a DataRow, then adding the DataRow to the DataTable? Or is using DataTable not optimal in the first place? The input comes from a database.

I don't care much about LOC, just speed. Can anyone give some advice on this?

like image 946
David Tang Avatar asked Feb 01 '11 07:02

David Tang


2 Answers

For such a big table, you should instead use the

public void WriteToServer(IDataReader reader)

method.

It may mean you'll have to implement yourself a "fake" IDataReader interface with your code (if you' don't get the data from an existing IDataReader), but this way, you'll get "streaming" from end to end, and will avoid a 200 million loop.

like image 93
Simon Mourier Avatar answered Nov 09 '22 13:11

Simon Mourier


Instead of holding a huge data table in memory, I would suggest implementing a IDataReader which serves up the data as the bulk copy goes. This will reduce the need to keep everything in memory upfront, and should thus serve to improve performance.

like image 29
Peter Lillevold Avatar answered Nov 09 '22 13:11

Peter Lillevold