Below is my application database table which contains SQL queries stored in a table: QueryStorage
Id Query ConnectionString Rdbms
1 select... Data Source Sql Server
2 select... Data Source Oracle
The SQL queries in the above table are updated through web service and we are not allowed to update above the queries, though we can add something on top of the query something like this:
Query stored in table: select id as LinkedColumn, Amount as CompareColumn from Source
Tweaked query from my c# app:select Q.LinkedColumn, Q.CompareColumn from (stored sql query) as Q
I am trying to compare 2 unordered list like below:
Query executed for Id = 1(Sql server)
from QueryStorage table records are like below:
select Id as LinkedColumn,CompareColumn from Source
List 1:
LinkedColumn CompareColumn
1 100
2 200
3 300
4 400
5 500
6 600
7 700
8 800
9 900
10 1000
Query executed for Id = 2(Oracle)
from QueryStorage table records are like below:
select Id as LinkedColumn,CompareColumn from Target
List 2:
LinkedColumn CompareColumn
10 10
9 20
8 30
7 40
6 50
5 60
4 70
3 80
2 90
1 5
I want to join on LinkedColumn from source to target
and then do comparison on CompareColumn
which should give me following output:
SrcLinkedColumn SrcCompareColumn TgtLinkedColumn TgtCompareColumn
1 100 1 5
2 200 2 90
Logic:
var data = (from s in List1.AsEnumerable()
join t in List2.AsEnumerable() on s.Field<string>("LinkedColumn") equals t.Field<string>("LinkedColumn")
where s.Field<decimal>("CompareColumn") != t.Field<decimal>("CompareColumn")
select new
{
srcLinkedcol = s.Field<string>("LinkedColumn"),
srcCompareCol = s.Field<decimal>("CompareColumn"),
tgtLinkedCol = t.Field<string>("LinkedColumn"),
tgtCompareCol = t.Field<decimal>("CompareColumn")
}).ToList();
There will be millions of records from source to target which I want to compare with so big issue is of out of memory exception
which we are facing right now by loading all data in memory and then doing comparison with above linq query.
There are 2 solutions which I have thought of like below:
1) Open 2 ordered data reader.
Pros :
- No memory exception
- Fast as there will be 1 to 1 comparision of LinkedColumn for List1 and
List2 records.
Cons :
- Order by is require on LinkedColumns and as i have no control over
query as because it is dumped by webservice in QueryStorage table so
user is explicitly require to submit query with order by on
LinkedColumn.
- Wont work if order by on Linkedcolumn is not present.
- Order by query have performance overhead so because of this user may not include order by on LinkedColumn in query.
2) Compare chunk by chunk records by tweaking query and adding OffSet and FetchRowNext
. This is how I am thinking for algorithm:
But i still feel that with second approach i can get memory exception issue because at some steps where data from source and target are not matching i will be storing them inside buffer(datatable or list etc..) for next chunk comparision.
Can anybody please guide me what should be the good algorithm for this or any better way to address this?
Note : I dont want to use LinkedServer and SSIS.
Your second solution looks like an attempt to re-invent the External merge sort. It is a valid approach if your data doesn't fit into memory, but it is overkill when your datasets fit into memory.
24 million rows with two numbers each (8 bytes) is only ~200MB of memory. Two datasets are 400MB. This is nothing on the modern desktop hardware.
Load both datasets in memory. In simple arrays. Sort arrays in memory. Find the difference by scanning both sorted arrays once. You don't need fancy LINQ for that.
Here is pseudocode. We have Array1
and Array2
. Maintain two variables that contain the current index of each array: Idx1
and Idx2
. Move the indexes along the arrays looking for places where LinkedColumn
match.
Idx1 = 0; Idx2 = 0;
while (true)
{
// scan arrays until both indexes point to the same LinkedColumn
// here we assume that both arrays are sorted by LinkedColumn
// and that values in LinkedColumn are unique
while (Idx1 < Array1.Length && Idx2 < Array2.Length &&
Array1[Idx1].LinkedColumn < Array2[Idx2].LinkedColumn)
{
// move along Array1
Idx1++;
}
while (Idx1 < Array1.Length && Idx2 < Array2.Length &&
Array1[Idx1].LinkedColumn > Array2[Idx2].LinkedColumn)
{
// move along Array2
Idx2++;
}
// at this point both indexes point to the same LinkedColumn
// or one or both of the arrays are over
if (Idx1 >= Array1.Length || Idx2 >= Array2.Length)
{
break;
}
// compare the values
if (Array1[Idx1].CompareColumn != Array2[Idx2].CompareColumn)
{
// TODO: output/save/print the difference
}
Idx1++; Idx2++;
}
you can dump both datasets in a database of your choice in two tables T1
and T2
, create unique index on LinkedColumn
in both tables and run this query:
SELECT
T1.LinkedColumn AS SrcLinkedColumn
,T1.CompareColumn AS SrcCompareColumn
,T2.LinkedColumn AS DstLinkedColumn
,T2.CompareColumn AS DstCompareColumn
FROM
T1 INNER JOIN T2 ON T1.LinkedColumn = T2.LinkedColumn
WHERE
T1.CompareColumn <> T2.CompareColumn
ORDER BY
T1.LinkedColumn
;
The pseudocode above performs the same merge join as the DBMS server would perform for this query.
If I understood correctly, you get tables data in C#
by somethings like DataReader
. If you store records that fetch from web service and then execute some query as @VladimirBaranov mentioned, it need too long time for storing and it not reasonable.
I think a better solution for comparing in this case is Binary Search. Time order of it is O(log n)
and Memory order of it is O(1). So it doesn't need to extra memory.
You can use Binary search like below:
1- Sorting smaller table.
Time order: If this table has n element, on average, executing time is O(nlogn)
(more information).
Memory order: O(1), because sort
method in C# is In-place.
2- Comparing another table records with sorted table by Binary search.
Time order: If another table has m records, executing time is O(m)*O(logn) = O(mlogn)
.
(more information)
Memory order: It needs no extra memory.
To conclusion, I thank may be best solution for your issue, but I think this solution is good solution that has O(mlogn)
executing time, O(1)
memory order and it implements only in C#, no need to save records in database.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With