Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Effecient way to compare data tables

I have below c# method to compare two data tables and return the mismatch records.

public DataTable GetTableDiff(DataTable dt1, DataTable dt2, string columnName)
{
    var StartTime = DateTime.Now;
    dt1.PrimaryKey = new DataColumn[] { dt1.Columns["N"] };
    dt2.PrimaryKey = new DataColumn[] { dt2.Columns["N"] };

    DataTable dtDifference = null;
    //Get the difference of two datatables
    var dr = from r in dt1.AsEnumerable()
             where !dt2.AsEnumerable().Any(r2 => r["N"].ToString().Trim().ToLower() == r2["N"].ToString().Trim().ToLower()
                 && r[columnName].ToString().Trim().ToLower() == r2[columnName].ToString().Trim().ToLower())
             select r;

    if (dr.Any())
    {
        dtDifference = dr.CopyToDataTable();
    }
    return dtDifference;
}

This code works, but it takes 1.24 minutes to compare 10,000 records in the datatable. Any way to make this faster?

N is the primary key and columnName is the column to compare.

Thanks.

like image 317
user1447718 Avatar asked May 07 '20 04:05

user1447718


People also ask

Which table is used for comparison of data?

The Comparison Table of SeqSphere+ is used to compare and visualize Sample data. Each row of the table represents one Sample and each column represents a metadata field.

How do I compare data in two tables in Excel?

Compare 2 Excel workbooks Open the workbooks you want to compare. Go to the View tab, Window group, and click the View Side by Side button. That's it!


1 Answers

First I would ask if you have tried this in a simple for/foreach loop instead and compared the performance?

At the moment you are creating a new Enumerable and then copying to a datatable. If you use a for/foreach loop then you can compare and copy in the same iteration.

You should also look at the string comparison. At the moment you are trimming then converting to lowercase. This will allocate new memory for each operation for each string as strings are immutable. So in your where statement you are basically doing this (up to) 8 times per iteration.

I would also ask if you really need Trim()? Is it likely that one DT will have a space at the front of the string and the other not? Or will a comparison still be true? Don't trim strings unless really needed.

Then you should use case insensitive string comparison rather than converting ToLower. This will be quicker. According to MS StringComparison.OrdinalIgnoreCase is better performing.

Do these and then compare performance and see how much difference you have

See also: https://docs.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings

Update:

This intrigued me, so I went back and done some tests. I generated 10,000 rows of random(ish) data in two datatables where every second row would match and executed your comparison vs a simplified for loop comparison with a String comparison like this:

  for (int i = 0; i < dt1.Rows.Count; i++)
  {
      if (dt1.Rows[i]["N"].ToString().Equals(dt2.Rows[i]["N"].ToString(), StringComparison.OrdinalIgnoreCase)
          && dt1.Rows[i][columnName].ToString().Equals(dt2.Rows[i][columnName].ToString(), StringComparison.OrdinalIgnoreCase))
      {
          dtDifference.Rows.Add(dt1.Rows[i].ItemArray);
      }
  }

Your code = 66,000ms -> 75,000ms

For loop code = 12ms -> 20ms

A significant difference!

Then I did a comparison using the for loop method but with the two different string comparison types for the string. Using my string comparison, vs yours. But I had to test on 1 million rows for this, to get a significant difference.

This differend by between 200ms and 800ms

So it seems in this case that the string comparison is not a major factor.

So it seems that your Linq query creating the datarows is what is taking the majority of time and not the comparison of the rows themselves.

So switch to using the for loop, and all will be well in the world again!

like image 199
jason.kaisersmith Avatar answered Sep 23 '22 18:09

jason.kaisersmith