Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare 2 unordered recordset in memory

Below is my application database table which contains SQL queries stored in a table: QueryStorage

Id           Query           ConnectionString       Rdbms
1            select...        Data Source           Sql Server
2            select...        Data Source           Oracle

The SQL queries in the above table are updated through web service and we are not allowed to update above the queries, though we can add something on top of the query something like this:

Query stored in table: select id as LinkedColumn, Amount as CompareColumn from Source

Tweaked query from my c# app:select Q.LinkedColumn, Q.CompareColumn from (stored sql query) as Q

I am trying to compare 2 unordered list like below:

Query executed for Id = 1(Sql server) from QueryStorage table records are like below:

select Id as LinkedColumn,CompareColumn from Source

List 1:

LinkedColumn     CompareColumn
1                100
2                200
3                300
4                400
5                500
6                600
7                700
8                800
9                900
10               1000

Query executed for Id = 2(Oracle) from QueryStorage table records are like below:

select Id as LinkedColumn,CompareColumn from Target

List 2:

LinkedColumn       CompareColumn
10                 10
9                  20
8                  30
7                  40
6                  50
5                  60
4                  70
3                  80
2                  90
1                  5

I want to join on LinkedColumn from source to target and then do comparison on CompareColumn which should give me following output:

SrcLinkedColumn      SrcCompareColumn     TgtLinkedColumn    TgtCompareColumn
    1                       100           1                  5
    2                       200           2                  90               

Logic:

var data = (from s in List1.AsEnumerable()
             join t in List2.AsEnumerable() on s.Field<string>("LinkedColumn") equals t.Field<string>("LinkedColumn")
             where s.Field<decimal>("CompareColumn") != t.Field<decimal>("CompareColumn")
                        select new
                        {
                            srcLinkedcol = s.Field<string>("LinkedColumn"),
                            srcCompareCol = s.Field<decimal>("CompareColumn"),
                            tgtLinkedCol = t.Field<string>("LinkedColumn"),
                            tgtCompareCol = t.Field<decimal>("CompareColumn")
                        }).ToList();

There will be millions of records from source to target which I want to compare with so big issue is of out of memory exception which we are facing right now by loading all data in memory and then doing comparison with above linq query.

There are 2 solutions which I have thought of like below:

1) Open 2 ordered data reader.

Pros :

    - No memory exception
    - Fast as there will be 1 to 1 comparision of LinkedColumn for List1 and 
      List2 records.

Cons :

    - Order by is require on LinkedColumns and as i have no control over 
      query as because it is dumped by webservice in QueryStorage table so 
      user is explicitly require to submit query with order by on 
      LinkedColumn.
    - Wont work if order by on Linkedcolumn is not present.
    - Order by query have performance overhead so because of this user may not include order by on LinkedColumn in query.                               

2) Compare chunk by chunk records by tweaking query and adding OffSet and FetchRowNext. This is how I am thinking for algorithm:

enter image description here

But i still feel that with second approach i can get memory exception issue because at some steps where data from source and target are not matching i will be storing them inside buffer(datatable or list etc..) for next chunk comparision.

Can anybody please guide me what should be the good algorithm for this or any better way to address this?

Note : I dont want to use LinkedServer and SSIS.

like image 789
ILoveStackoverflow Avatar asked Feb 02 '18 12:02

ILoveStackoverflow


2 Answers

Your second solution looks like an attempt to re-invent the External merge sort. It is a valid approach if your data doesn't fit into memory, but it is overkill when your datasets fit into memory.

24 million rows with two numbers each (8 bytes) is only ~200MB of memory. Two datasets are 400MB. This is nothing on the modern desktop hardware.

Load both datasets in memory. In simple arrays. Sort arrays in memory. Find the difference by scanning both sorted arrays once. You don't need fancy LINQ for that.

Here is pseudocode. We have Array1 and Array2. Maintain two variables that contain the current index of each array: Idx1 and Idx2. Move the indexes along the arrays looking for places where LinkedColumn match.

Idx1 = 0; Idx2 = 0;
while (true)
{
    // scan arrays until both indexes point to the same LinkedColumn
    // here we assume that both arrays are sorted by LinkedColumn
    // and that values in LinkedColumn are unique

    while (Idx1 < Array1.Length && Idx2 < Array2.Length &&
        Array1[Idx1].LinkedColumn < Array2[Idx2].LinkedColumn)
    {
        // move along Array1
        Idx1++;
    }

    while (Idx1 < Array1.Length && Idx2 < Array2.Length &&
        Array1[Idx1].LinkedColumn > Array2[Idx2].LinkedColumn)
    {
        // move along Array2
        Idx2++;
    }

    // at this point both indexes point to the same LinkedColumn
    // or one or both of the arrays are over

    if (Idx1 >= Array1.Length || Idx2 >= Array2.Length)
    {
        break;
    }

    // compare the values
    if (Array1[Idx1].CompareColumn != Array2[Idx2].CompareColumn)
    {
        // TODO: output/save/print the difference
    }

    Idx1++; Idx2++;
}

OR

you can dump both datasets in a database of your choice in two tables T1 and T2, create unique index on LinkedColumn in both tables and run this query:

SELECT
     T1.LinkedColumn  AS SrcLinkedColumn
    ,T1.CompareColumn AS SrcCompareColumn
    ,T2.LinkedColumn  AS DstLinkedColumn
    ,T2.CompareColumn AS DstCompareColumn
FROM
    T1 INNER JOIN T2 ON T1.LinkedColumn = T2.LinkedColumn
WHERE
    T1.CompareColumn <> T2.CompareColumn
ORDER BY
    T1.LinkedColumn
;

The pseudocode above performs the same merge join as the DBMS server would perform for this query.

like image 73
Vladimir Baranov Avatar answered Sep 30 '22 07:09

Vladimir Baranov


If I understood correctly, you get tables data in C# by somethings like DataReader. If you store records that fetch from web service and then execute some query as @VladimirBaranov mentioned, it need too long time for storing and it not reasonable.

I think a better solution for comparing in this case is Binary Search. Time order of it is O(log n) and Memory order of it is O(1). So it doesn't need to extra memory.

You can use Binary search like below:

1- Sorting smaller table.

Time order: If this table has n element, on average, executing time is O(nlogn) (more information).

Memory order: O(1), because sort method in C# is In-place.

2- Comparing another table records with sorted table by Binary search.

Time order: If another table has m records, executing time is O(m)*O(logn) = O(mlogn). (more information)

Memory order: It needs no extra memory.

To conclusion, I thank may be best solution for your issue, but I think this solution is good solution that has O(mlogn) executing time, O(1) memory order and it implements only in C#, no need to save records in database.

like image 28
Ali Soltani Avatar answered Sep 30 '22 07:09

Ali Soltani