Determine duplicates based off minimum N characters from smaller comparing string

Question

I have two lists, both containing models that share a common field, ID(String value). I am comparing the ID's for duplication.

I currently have a LINQ statement in place to determine the duplicated ID values, which stores them into a list of strings:

List<string> duplicateRecords = testData.TestRecords.GroupBy(aa => aa.ID).Where(x => x.Count() > 1).Select(y => y.Key).ToList();

And a second LINQ statement that maps a List of respected models based off the duplicated ID LINQ result:

List<Model> modelRecords = testData.Models.Where(x => duplicateRecords.Any(y => x.ID == y)).ToList();

These two LINQ statements do exactly what I expected them to do which is great. But now there is a recent request to determine duplicate ID's based off of their minimum N characters during a comparison. This minimum N comparison must happen for the last N characters in a string.

EX)

ID1: 123 == ID2: 123
ID1: 0123 == ID2: 123
ID1: 123 == ID2: 0123
ID1: 1230 != ID2: 123
ID1: 123 != ID2: 1230
ID1: 122110123 == ID2: 123

Hopefully those examples give some insight into the problem I am trying to solve. This could be done using foreach loops but I have come to experience the code becomes very messy and unmanageable on complex list query's.

So my question is this: How can I use the last N characters of the smaller of the two comparing strings to determine duplicates using LINQ?

Note: I am also very open to more elegant ways of solving this problem, would really appreciate excluding any for or foreach solutions.

Piotr L · Accepted Answer

I assume that when the input contains 123 and 0123 you want the result to have both of them

var input = new List<Model>()
{
    new Model {ID = "123"},
    new Model {ID = "0123"},
    new Model {ID = "1230"},
    new Model {ID = "12"},
    new Model {ID = "122110123"}
};

var result = input.Where(x => input.Any(y => y != x && (y.ID.EndsWith(x.ID) || x.ID.EndsWith(y.ID)))).ToList();
\this will return 123, 0123 and 122110123

If you want to check agains existing duplicateRecords list then this should work:

List<Model> modelRecords = testData.Models.Where(x => duplicateRecords.Any(y => x.ID.EndsWith(y) || y.EndsWith(x.ID))).ToList();

NetMage · Answer

In order to efficiently find the duplicates you need to sort the IDs by length so you can minimize the comparisons necessary. (The sort adds some overhead, but greatly decreases the comparisons that must be done - in my test where 9 IDs have and 3 are duplicates of 8 values, it is 15 comparisons sorted versus 42 unsorted.) Once you have them sorted by length, just compare each one to the ones that are equal to or longer (in case of complete duplicates) to find which short IDs need to be kept, marking any matches so you can skip them and then find all the Models that end with the found matches.

Create the List of IDs ordered by their length:

var orderedIDs = testData.TestRecords.Select(tr => tr.ID).OrderBy(id => id.Length).ToList();

I don't think there is any way to do this efficiently with LINQ, but a nested for loop skipping previous matches optimizes the search for duplicates.

First, variables to keep track of IDsand whichID`s have already matched:

var dupRecordSubIDs = new List<string>();
var alreadyMatched = new bool[testData.TestRecords.Count];

Now loop through the IDs and save the shorter matching IDs:

// foreach ID in length order
for (int n1 = 0; n1 < testData.TestRecords.Count-1; ++n1) {
    // skip the ones that already matched a shorter ID
    if (!alreadyMatched[n1]) {
        // remember if the shorter ID was alrady added
        var added_n1 = false;
        // compare the ID to all greater than or equal length IDs
        for (int n2 = n1 + 1; n2 < testData.TestRecords.Count; ++n2) {
            // if not previously matched, see if we have a new match
            if (!alreadyMatched[n2] && orderedIDs[n2].EndsWith(orderedIDs[n1])) {
                // only add the shorter ID once for new matches
                if (!added_n1) {
                    dupRecordSubIDs.Add(orderedIDs[n1]);
                    added_n1 = true;
                }
                // remember which longer IDs are already matched
                alreadyMatched[n2] = true;
            }
        }
    }
}

Now find all the Models that match one of the IDs with a duplicate:

var modelRecords = testData.Models.Where(m => dupRecordSubIDs.Any(d => m.ID.EndsWith(d))).ToList();

Determine duplicates based off minimum N characters from smaller comparing string

Tags:

string

c#

duplicates

linq

davedno

2 Answers

Piotr L

NetMage

Recent Activity

Donate For Us

Determine duplicates based off minimum N characters from smaller comparing string

Tags:

string

c#

duplicates

linq

davedno

2 Answers

Piotr L

NetMage

Related questions

Recent Activity

Donate For Us