I am interested in the asymptotic complexity (big O) of the GroupBy operation on unindexed datasets. What's the complexity of the best known algorithm and what's the complexity for algorithms that SQL servers and LINQ are using?
Ignoring the base SQL that the group by is working on, when presented to the GROUP BY operation itself, the complexity is just O(n) since the data is scanned per-row and aggregated in one pass. It scales linearly to n (the size of the dataset).
When Group By is added to a complex query the equation changes, O(n) becomes the upper bound that the Group By adds to the overall equation; it could be less if the inner complex query is such that in the resolution of the base query, the data is already sorted.
Grouping can be done in one pass (n complexity) on sorted rows (nlog(n) complexity) so complexity of group by is nlog(n) where n is number of rows. If there are indices for each column used in group by statement, the sorting is not necessary and the complexity is n.
About Linq, I guess you want to know about the Linq-to-object group by complexity (Enumerable.GroupBy
).
Checking the implementation with ILSpy, it appears to me it is O(n). (.Net Framework 4 series.)
It enumerates the source collection once. For each element, it computes its grouping key. Then it checks if it has already the key in a hashtable mapping to elements lists, adding the key to the hashtable if it is missing. Then it adds the element to the corresponding entry list in the hashtable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With