I have below c# method to compare two data tables and return the mismatch records.
public DataTable GetTableDiff(DataTable dt1, DataTable dt2, string columnName)
{
var StartTime = DateTime.Now;
dt1.PrimaryKey = new DataColumn[] { dt1.Columns["N"] };
dt2.PrimaryKey = new DataColumn[] { dt2.Columns["N"] };
DataTable dtDifference = null;
//Get the difference of two datatables
var dr = from r in dt1.AsEnumerable()
where !dt2.AsEnumerable().Any(r2 => r["N"].ToString().Trim().ToLower() == r2["N"].ToString().Trim().ToLower()
&& r[columnName].ToString().Trim().ToLower() == r2[columnName].ToString().Trim().ToLower())
select r;
if (dr.Any())
{
dtDifference = dr.CopyToDataTable();
}
return dtDifference;
}
This code works, but it takes 1.24 minutes to compare 10,000 records in the datatable. Any way to make this faster?
N
is the primary key and columnName
is the column to compare.
Thanks.
The Comparison Table of SeqSphere+ is used to compare and visualize Sample data. Each row of the table represents one Sample and each column represents a metadata field.
Compare 2 Excel workbooks Open the workbooks you want to compare. Go to the View tab, Window group, and click the View Side by Side button. That's it!
First I would ask if you have tried this in a simple for/foreach loop instead and compared the performance?
At the moment you are creating a new Enumerable and then copying to a datatable. If you use a for/foreach loop then you can compare and copy in the same iteration.
You should also look at the string comparison. At the moment you are trimming then converting to lowercase. This will allocate new memory for each operation for each string as strings are immutable. So in your where statement you are basically doing this (up to) 8 times per iteration.
I would also ask if you really need Trim()
? Is it likely that one DT will have a space at the front of the string and the other not? Or will a comparison still be true? Don't trim strings unless really needed.
Then you should use case insensitive string comparison rather than converting ToLower
. This will be quicker. According to MS StringComparison.OrdinalIgnoreCase
is better performing.
Do these and then compare performance and see how much difference you have
See also: https://docs.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings
Update:
This intrigued me, so I went back and done some tests. I generated 10,000 rows of random(ish) data in two datatables where every second row would match and executed your comparison vs a simplified for loop comparison with a String comparison like this:
for (int i = 0; i < dt1.Rows.Count; i++)
{
if (dt1.Rows[i]["N"].ToString().Equals(dt2.Rows[i]["N"].ToString(), StringComparison.OrdinalIgnoreCase)
&& dt1.Rows[i][columnName].ToString().Equals(dt2.Rows[i][columnName].ToString(), StringComparison.OrdinalIgnoreCase))
{
dtDifference.Rows.Add(dt1.Rows[i].ItemArray);
}
}
Your code = 66,000ms -> 75,000ms
For loop code = 12ms -> 20ms
A significant difference!
Then I did a comparison using the for loop method but with the two different string comparison types for the string. Using my string comparison, vs yours. But I had to test on 1 million rows for this, to get a significant difference.
This differend by between 200ms and 800ms
So it seems in this case that the string comparison is not a major factor.
So it seems that your Linq query creating the datarows is what is taking the majority of time and not the comparison of the rows themselves.
So switch to using the for loop, and all will be well in the world again!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With