I have an array of custom objects named AnalysisResult
. The array can contain hundreds of thousands of objects; and, occasionally I need only the Distinct()
elements of that array. So, I wrote a item comparer class called AnalysisResultDistinctItemComparer
and do my call like this:
public static AnalysisResult[] GetDistinct(AnalysisResult[] results)
{
return results.Distinct(new AnalysisResultDistinctItemComparer()).ToArray();
}
My problem here is that this call can take a LONG time (on the order of minutes) when the array is particularly big (greater than 200,000 objects).
I currently call that method in a background worker and display a spinning gif to alert the user that the method is being performed and that the application has not frozen. This is all fine and well but it does not give the user any indication of the current progress.
I really need to be able to indicate to the user the current progress of this action; but, I have been unable to come up with a good approach. I was playing with doing something like this:
public static AnalysisResult[] GetDistinct(AnalysisResult[] results)
{
var query = results.Distinct(new AnalysisResultDistinctItemComparer());
List<AnalysisResult> retVal = new List<AnalysisResult>();
foreach(AnalysisResult ar in query)
{
// Show progress here
retVal.Add(ar);
}
return retVal.ToArray();
}
But the problem is that I have no way of knowing what my actual progress is. Thoughts? Suggestions?
Don't call ToArray()
at the end of your method, just use yield return
. So do this:
public static IEnumerable<AnalysisResult> Distinct(AnalysisResult[] results)
{
var query = results.Distinct(new AnalysisResultDistinctItemComparer());
foreach(AnalysisResult ar in query)
{
// Use yield return here, so that the iteration remains lazy.
yield return ar;
}
}
Basically, yield return
does some compiler magic to ensure that the iteration remains lazy, so you don't have to wait for a complete new collection to be created before returning to the caller. Instead, as each item is calculated, you return that item immediately to the consumer (which can then perform the updating logic -- per-item if necessary). You can use the same technique in your GetDistinct
method as well.
Jon Skeet has an implementation that looks like this (LINQ's Distinct() on a particular property):
public static IEnumerable<TSource> DistinctBy<TSource, TKey>
(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
{
HashSet<TKey> seenKeys = new HashSet<TKey>();
foreach (TSource element in source)
{
if (seenKeys.Add(keySelector(element)))
{
yield return element;
}
}
}
Notice here that he uses a HashSet
, which is built to disallow duplicates. Simply check to see if the item has already been added, and if not, then return it.
That all said, remember that this is an Algorithms-and-Data Structures type question. It would be much easier to do something like this:
Dictionary<Key, Value> distinctItems = new Dictionary<Key, Value>();
foreach (var item in nonDistinctSetOfItems) {
if (distinctItems.ConatainsKey(item.KeyProperty) == false) {
distinctItems.Add(item.KeyProperty, item);
}
}
... = distinctItems.Values // This would contain only the distinct items.
That is, a Symbol Table/Dictionary
is built for just this sort of problem - associating entries with unique keys. If you keep your data stored this way, it greatly simplifies the problem. Don't overlook the simple solution!
Given the design of that Distinct method, you're iterating over the entire collection every time you call Distinct. Have you considered writing a custom collection that adds to an index each time you add an object to the array?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With