Is there a fast built-in way to check if an IEnumerable<string>
contains only distinct strings?
In the beginning I started with:
var enumAsArray = enum.ToArray();
if (enumAsArray.Length != enumAsArray.Distinct().Count())
throw ...
However, this looks like it is O(2n) - is it? ToArray()
might be O(1)?
This looks faster:
var set = new HashSet<string>();
foreach (var str in enum)
{
if (!set.Add(str))
throw ...
}
This should be O(n), however, is there a built-in way too?
Edit: Maybe Distinct() uses this internally?
Solution: After considering all the comments and the answer, I wrote an extension method for my second solution, as this seems to be the fastest version and the most readable too:
public static bool ContainsDuplicates<T>(this IEnumerable<T> e)
{
var set = new HashSet<T>();
// ReSharper disable LoopCanBeConvertedToQuery
foreach (var item in e)
// ReSharper restore LoopCanBeConvertedToQuery
{
if (!set.Add(item))
return true;
}
return false;
}
Your second code sample is short, simple, clearly effective, and if not the completely perfect ideal solution, is clearly rather close to it. It seems like a perfectly acceptable solution to your particular problems.
Unless your use of that particular solution is shown to cause performance problems after you've noticed issues and done performance testing, I'd leave it as is. Given how little room I can see for improvement in general, that doesn't seem likely. It's not a sufficiently lengthy or complex solution that trying to find something "shorter" or more concise is going to be worth your time and effort.
In short, there are almost certainly better places in your code to spend your time; what you have already is fine.
To answer your specific questions:
However, this looks like it is O(2n) - is it?
Yes, it is.
ToArray()
might be O(1)?
No, it's not.
Maybe Distinct()
uses this internally?
It does use a HashSet
, and it looks pretty similar, but it simply ignores duplicate items; it doesn't provide any indication to the caller that it has just passed a duplicate item. As a result, you need to iterate the whole sequence twice to see if it removed anything, rather than stopping when the first duplicate is encountered. This is the difference between something that always iterates the full sequence twice and something that might iterate the full sequence once, but can short circuit and stop as soon as it has ensured an answer.
is there a built-in way too?
Well, you showed one, it's just not as efficient. I can think of no entire LINQ based solution as efficient as what you showed. The best I can think of would be: data.Except(data).Any()
. This is a bit better than your distinct compared to the regular count in that the second iteration can short circuit (but not the first) but it also iterates the sequence twice, and still is worse than your non-LINQ solution, so it's still not worth using.
Here is a possible refinement to the OP's answer:
public static IEnumerable<T> Duplicates<T>(this IEnumerable<T> e)
{
var set = new HashSet<T>();
// ReSharper disable LoopCanBeConvertedToQuery
foreach (var item in e)
// ReSharper restore LoopCanBeConvertedToQuery
{
if (!set.Add(item))
yield return item;
}
}
You now have a potentially useful method to get the actual duplicate items and you can answer your original question with:
collection.Duplicates().Any()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With