Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bulk inserting best way to about it? + Helping me understand fully what I found so far

So I saw this post here and read it and it seems like bulk copy might be the way to go.

What’s the best way to bulk database inserts from c#?

I still have some questions and want to know how things actually work.

So I found 2 tutorials.

http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241

http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx

First way uses 2 ado.net 2.0 features. BulkInsert and BulkCopy. the second one uses linq to sql and OpenXML.

This sort of appeals to me as I am using linq to sql already and prefer it over ado.net. However as one person pointed out in the posts what he just going around the issue at the cost of performance( nothing wrong with that in my opinion)

First I will talk about the 2 ways in the first tutorial

I am using VS2010 Express(for testing of the tutorials I used VS2008 and not sure what .net version I just loaded up there sample files and ran them), .net 4.0, MVC 2.0, SQl Server 2005

  1. Is ado.net 2.0 the most current version?
  2. Based on the technology I am using, is there some updates to what I am going to show that would improve it somehow?
  3. Is there any thing that these tutorial left out that I should know about?

BulkInsert

I am using this table for all the examples.

CREATE TABLE [dbo].[TBL_TEST_TEST]
(
    ID INT IDENTITY(1,1) PRIMARY KEY,
    [NAME] [varchar](50) 
)

SP Code

USE [Test]
GO
/****** Object:  StoredProcedure [dbo].[sp_BatchInsert]    Script Date: 05/19/2010 15:12:47 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[sp_BatchInsert] (@Name VARCHAR(50) )
AS
BEGIN
            INSERT INTO TBL_TEST_TEST VALUES (@Name);
END 

C# Code

/// <summary>
/// Another ado.net 2.0 way that uses a stored procedure to do a bulk insert.
/// Seems slower then "BatchBulkCopy" way and it crashes when you try to insert 500,000 records in one go.
/// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241
/// </summary>
private static void BatchInsert()
{
    // Get the DataTable with Rows State as RowState.Added
    DataTable dtInsertRows = GetDataTable();

    SqlConnection connection = new SqlConnection(connectionString);
    SqlCommand command = new SqlCommand("sp_BatchInsert", connection);
    command.CommandType = CommandType.StoredProcedure;
    command.UpdatedRowSource = UpdateRowSource.None;

    // Set the Parameter with appropriate Source Column Name
    command.Parameters.Add("@Name", SqlDbType.VarChar, 50, dtInsertRows.Columns[0].ColumnName);

    SqlDataAdapter adpt = new SqlDataAdapter();
    adpt.InsertCommand = command;
    // Specify the number of records to be Inserted/Updated in one go. Default is 1.
    adpt.UpdateBatchSize = 1000;

    connection.Open();
    int recordsInserted = adpt.Update(dtInsertRows);
    connection.Close();
}

So first thing is the batch size. Why would you set a batch size to anything but the number of records you are sending? Like I am sending 500,000 records so I did a Batch size of 500,000.

Next why does it crash when I do this? If I set it to 1000 for batch size it works just fine.

System.Data.SqlClient.SqlException was unhandled
  Message="A transport-level error has occurred when sending the request to the server. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.)"
  Source=".Net SqlClient Data Provider"
  ErrorCode=-2146232060
  Class=20
  LineNumber=0
  Number=233
  Server=""
  State=0
  StackTrace:
       at System.Data.Common.DbDataAdapter.UpdatedRowStatusErrors(RowUpdatedEventArgs rowUpdatedEvent, BatchCommandInfo[] batchCommands, Int32 commandCount)
       at System.Data.Common.DbDataAdapter.UpdatedRowStatus(RowUpdatedEventArgs rowUpdatedEvent, BatchCommandInfo[] batchCommands, Int32 commandCount)
       at System.Data.Common.DbDataAdapter.Update(DataRow[] dataRows, DataTableMapping tableMapping)
       at System.Data.Common.DbDataAdapter.UpdateFromDataTable(DataTable dataTable, DataTableMapping tableMapping)
       at System.Data.Common.DbDataAdapter.Update(DataTable dataTable)
       at TestIQueryable.Program.BatchInsert() in C:\Users\a\Downloads\TestIQueryable\TestIQueryable\TestIQueryable\Program.cs:line 124
       at TestIQueryable.Program.Main(String[] args) in C:\Users\a\Downloads\TestIQueryable\TestIQueryable\TestIQueryable\Program.cs:line 16
  InnerException: 

Time it took to insert 500,000 records with insert batch size of 1000 took "2 mins and 54 seconds"

Of course this is no official time I sat there with a stop watch( I am sure there are better ways but was too lazy to look what they where)

So I find that kinda slow compared to all my other ones(expect the linq to sql insert one) and I am not really sure why.

Next I looked at bulkcopy

/// <summary>
/// An ado.net 2.0 way to mass insert records. This seems to be the fastest.
/// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241
/// </summary>
private static void BatchBulkCopy()
{
    // Get the DataTable 
    DataTable dtInsertRows = GetDataTable();

    using (SqlBulkCopy sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.KeepIdentity))
    {
        sbc.DestinationTableName = "TBL_TEST_TEST";

        // Number of records to be processed in one go
        sbc.BatchSize = 500000;

        // Map the Source Column from DataTabel to the Destination Columns in SQL Server 2005 Person Table
        // sbc.ColumnMappings.Add("ID", "ID");
        sbc.ColumnMappings.Add("NAME", "NAME");

        // Number of records after which client has to be notified about its status
        sbc.NotifyAfter = dtInsertRows.Rows.Count;

        // Event that gets fired when NotifyAfter number of records are processed.
        sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler(sbc_SqlRowsCopied);

        // Finally write to server
        sbc.WriteToServer(dtInsertRows);
        sbc.Close();
    }

}

This one seemed to go really fast and did not even need a SP( can you use SP with bulk copy? If you can would it be better?)

BatchCopy had no problem with a 500,000 batch size.So again why make it smaller then the number of records you want to send?

I found that with BatchCopy and 500,000 batch size it took only 5 seconds to complete. I then tried with a batch size of 1,000 and it only took 8 seconds.

So much faster then the bulkinsert one above.

Now I tried the other tutorial.

USE [Test]
GO
/****** Object:  StoredProcedure [dbo].[spTEST_InsertXMLTEST_TEST]    Script Date: 05/19/2010 15:39:03 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[spTEST_InsertXMLTEST_TEST](@UpdatedProdData nText)
AS 
 DECLARE @hDoc int   

 exec sp_xml_preparedocument @hDoc OUTPUT,@UpdatedProdData 

 INSERT INTO TBL_TEST_TEST(NAME)
 SELECT XMLProdTable.NAME
    FROM OPENXML(@hDoc, 'ArrayOfTBL_TEST_TEST/TBL_TEST_TEST', 2)   
       WITH (
                ID Int,                 
                NAME varchar(100)
            ) XMLProdTable

EXEC sp_xml_removedocument @hDoc

C# code.

/// <summary>
/// This is using linq to sql to make the table objects. 
/// It is then serailzed to to an xml document and sent to a stored proedure
/// that then does a bulk insert(I think with OpenXML)
///  http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx
/// </summary>
private static void LinqInsertXMLBatch()
{
    using (TestDataContext db = new TestDataContext())
    {
        TBL_TEST_TEST[] testRecords = new TBL_TEST_TEST[500000];
        for (int count = 0; count < 500000; count++)
        {
            TBL_TEST_TEST testRecord = new TBL_TEST_TEST();
            testRecord.NAME = "Name : " + count;
            testRecords[count] = testRecord;
        }

        StringBuilder sBuilder = new StringBuilder();
        System.IO.StringWriter sWriter = new System.IO.StringWriter(sBuilder);
        XmlSerializer serializer = new XmlSerializer(typeof(TBL_TEST_TEST[]));
        serializer.Serialize(sWriter, testRecords);
        db.insertTestData(sBuilder.ToString());
    }
}

So I like this because I get to use objects even though it is kinda redundant. I don't get how the SP works. Like I don't get the whole thing. I don't know if OPENXML has some batch insert under the hood but I do not even know how to take this example SP and change it to fit my tables since like I said I don't know what is going on.

I also don't know what would happen if the object you have more tables in it. Like say I have a ProductName table what has a relationship to a Product table or something like that.

In linq to sql you could get the product name object and make changes to the Product table in that same object. So I am not sure how to take that into account. I am not sure if I would have to do separate inserts or what.

The time was pretty good for 500,000 records it took 52 seconds

The last way of course was just using linq to do it all and it was pretty bad.

/// <summary>
/// This is using linq to sql to to insert lots of records. 
/// This way is slow as it uses no mass insert.
/// Only tried to insert 50,000 records as I did not want to sit around till it did 500,000 records.
/// http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx
/// </summary>
private static void LinqInsertAll()
{
    using (TestDataContext db = new TestDataContext())
    {
        db.CommandTimeout = 600;
        for (int count = 0; count < 50000; count++)
        {
            TBL_TEST_TEST testRecord = new TBL_TEST_TEST();
            testRecord.NAME = "Name : " + count;
            db.TBL_TEST_TESTs.InsertOnSubmit(testRecord);
        }
        db.SubmitChanges();
    }
}

I did only 50,000 records and that took over a minute to do.

So I really narrowed it done to the linq to sql bulk insert way or bulk copy. I am just not sure how to do it when you have relationship for either way. I am not sure how they both stand up when doing updates instead of inserts as I have not gotten around to try it yet.

I don't think I will ever need to insert/update more than 50,000 records at one type but at the same time I know I will have to do validation on records before inserting so that will slow it down and that sort of makes linq to sql nicer as your got objects especially if your first parsing data from a xml file before you insert into the database.

Full C# code

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Serialization;
using System.Data;
using System.Data.SqlClient;

namespace TestIQueryable
{
    class Program
    {
        private static string connectionString = "";
        static void Main(string[] args)
        {
            BatchInsert();
            Console.WriteLine("done");
        }

        /// <summary>
        /// This is using linq to sql to to insert lots of records. 
        /// This way is slow as it uses no mass insert.
        /// Only tried to insert 50,000 records as I did not want to sit around till it did 500,000 records.
        /// http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx
        /// </summary>
        private static void LinqInsertAll()
        {
            using (TestDataContext db = new TestDataContext())
            {
                db.CommandTimeout = 600;
                for (int count = 0; count < 50000; count++)
                {
                    TBL_TEST_TEST testRecord = new TBL_TEST_TEST();
                    testRecord.NAME = "Name : " + count;
                    db.TBL_TEST_TESTs.InsertOnSubmit(testRecord);
                }
                db.SubmitChanges();
            }
        }

        /// <summary>
        /// This is using linq to sql to make the table objects. 
        /// It is then serailzed to to an xml document and sent to a stored proedure
        /// that then does a bulk insert(I think with OpenXML)
        ///  http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx
        /// </summary>
        private static void LinqInsertXMLBatch()
        {
            using (TestDataContext db = new TestDataContext())
            {
                TBL_TEST_TEST[] testRecords = new TBL_TEST_TEST[500000];
                for (int count = 0; count < 500000; count++)
                {
                    TBL_TEST_TEST testRecord = new TBL_TEST_TEST();
                    testRecord.NAME = "Name : " + count;
                    testRecords[count] = testRecord;
                }

                StringBuilder sBuilder = new StringBuilder();
                System.IO.StringWriter sWriter = new System.IO.StringWriter(sBuilder);
                XmlSerializer serializer = new XmlSerializer(typeof(TBL_TEST_TEST[]));
                serializer.Serialize(sWriter, testRecords);
                db.insertTestData(sBuilder.ToString());
            }
        }

        /// <summary>
        /// An ado.net 2.0 way to mass insert records. This seems to be the fastest.
        /// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241
        /// </summary>
        private static void BatchBulkCopy()
        {
            // Get the DataTable 
            DataTable dtInsertRows = GetDataTable();

            using (SqlBulkCopy sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.KeepIdentity))
            {
                sbc.DestinationTableName = "TBL_TEST_TEST";

                // Number of records to be processed in one go
                sbc.BatchSize = 500000;

                // Map the Source Column from DataTabel to the Destination Columns in SQL Server 2005 Person Table
                // sbc.ColumnMappings.Add("ID", "ID");
                sbc.ColumnMappings.Add("NAME", "NAME");

                // Number of records after which client has to be notified about its status
                sbc.NotifyAfter = dtInsertRows.Rows.Count;

                // Event that gets fired when NotifyAfter number of records are processed.
                sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler(sbc_SqlRowsCopied);

                // Finally write to server
                sbc.WriteToServer(dtInsertRows);
                sbc.Close();
            }

        }


        /// <summary>
        /// Another ado.net 2.0 way that uses a stored procedure to do a bulk insert.
        /// Seems slower then "BatchBulkCopy" way and it crashes when you try to insert 500,000 records in one go.
        /// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241
        /// </summary>
        private static void BatchInsert()
        {
            // Get the DataTable with Rows State as RowState.Added
            DataTable dtInsertRows = GetDataTable();

            SqlConnection connection = new SqlConnection(connectionString);
            SqlCommand command = new SqlCommand("sp_BatchInsert", connection);
            command.CommandType = CommandType.StoredProcedure;
            command.UpdatedRowSource = UpdateRowSource.None;

            // Set the Parameter with appropriate Source Column Name
            command.Parameters.Add("@Name", SqlDbType.VarChar, 50, dtInsertRows.Columns[0].ColumnName);

            SqlDataAdapter adpt = new SqlDataAdapter();
            adpt.InsertCommand = command;
            // Specify the number of records to be Inserted/Updated in one go. Default is 1.
            adpt.UpdateBatchSize = 500000;

            connection.Open();
            int recordsInserted = adpt.Update(dtInsertRows);
            connection.Close();
        }



        private static DataTable GetDataTable()
        {
            // You First need a DataTable and have all the insert values in it
            DataTable dtInsertRows = new DataTable();
            dtInsertRows.Columns.Add("NAME");

            for (int i = 0; i < 500000; i++)
            {
                DataRow drInsertRow = dtInsertRows.NewRow();
                string name = "Name : " + i;
                drInsertRow["NAME"] = name;
                dtInsertRows.Rows.Add(drInsertRow);


            }
            return dtInsertRows;

        }


        static void sbc_SqlRowsCopied(object sender, SqlRowsCopiedEventArgs e)
        {
            Console.WriteLine("Number of records affected : " + e.RowsCopied.ToString());
        }


    }
}
like image 630
chobo2 Avatar asked May 19 '10 22:05

chobo2


1 Answers

The batch size is there to reduce the impact of network latency. It does not need to be more than a few thousand. Multiple statements are collected together and sent as a unit, so you get the hit of one network trip once every N statements, rather than once per statement.

like image 188
mdma Avatar answered Oct 20 '22 00:10

mdma