I have two lists, both containing models that share a common field, ID(String value). I am comparing the ID's for duplication.
I currently have a LINQ statement in place to determine the duplicated ID values, which stores them into a list of strings:
List<string> duplicateRecords = testData.TestRecords.GroupBy(aa => aa.ID).Where(x => x.Count() > 1).Select(y => y.Key).ToList();
And a second LINQ statement that maps a List of respected models based off the duplicated ID LINQ result:
List<Model> modelRecords = testData.Models.Where(x => duplicateRecords.Any(y => x.ID == y)).ToList();
These two LINQ statements do exactly what I expected them to do which is great. But now there is a recent request to determine duplicate ID's based off of their minimum N characters during a comparison. This minimum N comparison must happen for the last N characters in a string.
EX)
Hopefully those examples give some insight into the problem I am trying to solve. This could be done using foreach loops but I have come to experience the code becomes very messy and unmanageable on complex list query's.
So my question is this: How can I use the last N characters of the smaller of the two comparing strings to determine duplicates using LINQ?
Note: I am also very open to more elegant ways of solving this problem, would really appreciate excluding any for or foreach solutions.
I assume that when the input contains 123 and 0123 you want the result to have both of them
var input = new List<Model>()
{
new Model {ID = "123"},
new Model {ID = "0123"},
new Model {ID = "1230"},
new Model {ID = "12"},
new Model {ID = "122110123"}
};
var result = input.Where(x => input.Any(y => y != x && (y.ID.EndsWith(x.ID) || x.ID.EndsWith(y.ID)))).ToList();
\\this will return 123, 0123 and 122110123
If you want to check agains existing duplicateRecords list then this should work:
List<Model> modelRecords = testData.Models.Where(x => duplicateRecords.Any(y => x.ID.EndsWith(y) || y.EndsWith(x.ID))).ToList();
In order to efficiently find the duplicates you need to sort the IDs by length so you can minimize the comparisons necessary. (The sort adds some overhead, but greatly decreases the comparisons that must be done - in my test where 9 IDs have and 3 are duplicates of 8 values, it is 15 comparisons sorted versus 42 unsorted.) Once you have them sorted by length, just compare each one to the ones that are equal to or longer (in case of complete duplicates) to find which short IDs need to be kept, marking any matches so you can skip them and then find all the Models that end with the found matches.
Create the List of IDs ordered by their length:
var orderedIDs = testData.TestRecords.Select(tr => tr.ID).OrderBy(id => id.Length).ToList();
I don't think there is any way to do this efficiently with LINQ, but a nested for loop skipping previous matches optimizes the search for duplicates.
First, variables to keep track of IDsand whichID`s have already matched:
var dupRecordSubIDs = new List<string>();
var alreadyMatched = new bool[testData.TestRecords.Count];
Now loop through the IDs and save the shorter matching IDs:
// foreach ID in length order
for (int n1 = 0; n1 < testData.TestRecords.Count-1; ++n1) {
// skip the ones that already matched a shorter ID
if (!alreadyMatched[n1]) {
// remember if the shorter ID was alrady added
var added_n1 = false;
// compare the ID to all greater than or equal length IDs
for (int n2 = n1 + 1; n2 < testData.TestRecords.Count; ++n2) {
// if not previously matched, see if we have a new match
if (!alreadyMatched[n2] && orderedIDs[n2].EndsWith(orderedIDs[n1])) {
// only add the shorter ID once for new matches
if (!added_n1) {
dupRecordSubIDs.Add(orderedIDs[n1]);
added_n1 = true;
}
// remember which longer IDs are already matched
alreadyMatched[n2] = true;
}
}
}
}
Now find all the Models that match one of the IDs with a duplicate:
var modelRecords = testData.Models.Where(m => dupRecordSubIDs.Any(d => m.ID.EndsWith(d))).ToList();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With