Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fast reduce a datatable according to existing entries in a DB

I'm currently trying to bulkinsert a datatable into a database. It works fine and fast. The only problem occurs if there are any rows that are already in the database (duplicate key).

To counter this I have modified my program so that I first check for each new entry if it already exists in the database or not. Which is.......slow (In the current cases I don't have many entries but later on its over 200k entries that I need to check and that a few times). Thus I need to make it faster as it is now (if possible).

The datatable is structured this way:

DataTable transactionTable.Columns.Add("DeviceId", typeof(Int32));
transactionTable.Columns.Add("LogDate", typeof(DateTime));
transactionTable.Columns.Add("LogType", typeof(Int32));
transactionTable.Columns.Add("LogText", typeof(String));

transactionTable.PrimaryKey = new DataColumn[3] {
    transactionTable.Columns[0],
    transactionTable.Columns[1],
    transactionTable.Columns[2]
};

What I have so far is the following:

DataTable insertTable = transactionTable.Copy();
insertTable.Clear();
using (SqlConnection sqlcon = new SqlConnection(this.GetConnString()))
{
    sqlcon.Open();
    foreach (var entry in transactionTable.AsEnumerable())
    {
        using (SqlCommand sqlCom = sqlCon.CreateCommand())
        {
            sqlCom.Parameters.Clear();
            sqlCom.CommandText = "SELECT 1 FROM myTable WHERE"
                    + " DeviceId = @DeviceId AND LogDate = @LogDate"
                    + " AND LogType = @LogType"
            sqlCom.Parameters.AddWithValue("@DeviceId", entry.Field<Int32>("DeviceId"));
            sqlCom.Parameters.AddWithValue("@LogDate", entry.Field<DateTime>("LogDate"));
            sqlCom.Parameters.AddWithValue("@LogType", entry.Field<Int32>("LogType"));

            using (SqlDataREader myRead = sqlCon.ExecuteReader()
            {
                myRead.Read();

                if (myRead.HasRows == false)
                {
                    insertTable.Rows.Add(entry.ItemArray);
                }
            }

        }
    }
}

// And afterwards the bulkinsert which I think is out of scope for the question itself 
// (I use the insertTable there)

Now my question is: Is there any way to do this faster in order to not get the key violation problem?

like image 850
Thomas Avatar asked Oct 20 '15 08:10

Thomas


3 Answers

In this case I would use some staging table. Here is some steps:

  1. Bulk insert into staging table(using SqlBulkCopy)
  2. Inserting into base table using stored proc with left join to eliminate existing rows
  3. Truncate staging table

So you will need to delete foreach statement in your code, add stored proc for inserting to base table, add stored proc for truncating. Or you can combine last 2 steps in one.

like image 120
Giorgi Nakeuri Avatar answered Nov 05 '22 12:11

Giorgi Nakeuri


I have a similar set up.

I'm using a stored procedure with a Table-Valued parameter and MERGE statement. See also Table-Valued Parameters for example how to use them in .NET.

I would shift the focus of the problem from simple bulk insert to merging a batch of rows into a table with existing data.

Destination table

CREATE TABLE [dbo].[MyTable](
    [DeviceId] [int] NOT NULL,
    [LogDate] [datetime] NOT NULL,
    [LogType] [int] NOT NULL,
    [LogText] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED 
(
    [DeviceId] ASC,
    [LogDate] ASC,
    [LogType] ASC
))

Create user-defined table type

CREATE TYPE [dbo].[MyTableType] AS TABLE(
    [DeviceId] [int] NOT NULL,
    [LogDate] [datetime] NOT NULL,
    [LogType] [int] NOT NULL,
    [LogText] [nvarchar](50) NOT NULL,
    PRIMARY KEY CLUSTERED 
(
    [DeviceId] ASC,
    [LogDate] ASC,
    [LogType] ASC
))

Test and measure whether specifying PRIMARY KEY for the TYPE makes overall process faster or slower.

Stored procedure with TVP

CREATE PROCEDURE [dbo].[MergeMyTable]
    @ParamRows dbo.MyTableType READONLY
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

    BEGIN TRANSACTION;
    BEGIN TRY

        MERGE INTO dbo.MyTable as Dest
        USING
        (
            SELECT
                TT.[DeviceId],
                TT.[LogDate],
                TT.[LogType],
                TT.[LogText]
            FROM
                @ParamRows AS TT
        ) AS Src
        ON 
            (Dest.[DeviceId] = Src.[DeviceId]) AND
            (Dest.[LogDate]  = Src.[LogDate]) AND
            (Dest.[LogType]  = Src.[LogType])
        WHEN MATCHED THEN 
        UPDATE SET 
            Dest.[LogText] = Src.[LogText]
        WHEN NOT MATCHED BY TARGET THEN 
        INSERT
            ([DeviceId]
            ,[LogDate]
            ,[LogType]
            ,[LogText])
        VALUES
            (Src.[DeviceId],
            Src.[LogDate],
            Src.[LogType],
            Src.[LogText]);

        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        ROLLBACK TRANSACTION;
    END CATCH;

END

Call this stored procedure passing it a batch of rows to merge. Test and measure how performance changes with the size of the batch. Try batches with 1K, 10K, 100K rows.

If you never want to update existing rows with new values, remove the WHEN MATCHED THEN part of the MERGE, it will work faster.

like image 41
Vladimir Baranov Avatar answered Nov 05 '22 12:11

Vladimir Baranov


You can drop and recreate your index with the IGNORE_DUP_KEY set to ON. Something like this:

ALTER TABLE datatable
ADD CONSTRAINT PK_datatable 
PRIMARY KEY CLUSTERED (DeviceId,LogDate,LogType,LogText) 
WITH (IGNORE_DUP_KEY = ON)

What this option does is report a duplicate key error with a different severity and message when any duplicate inserts for the index are attempted. It will not allow duplicates to be entered, but it will continue to insert all the records that are not duplicates and only give a warning message if duplicates were found and ignored.

More info at this link: Creating Unique Indexes.

like image 2
Brian Pressler Avatar answered Nov 05 '22 14:11

Brian Pressler