I've written some very basic tools for grouping, pivoting, unioning and subtotaling datasets sourced from non DB sources (eg: CSV, OLTP systems). The "group by" methods sit at the core of most of these.
However i'm sure lot of work has been done in making efficient algorithms for grouping data... and i'm sure i'm not using them. And my Google-fu has completely failed to turn anything up.
Are there any good online sources or books describing the better methods to create grouped data?
Or should i just start looking at the MySQL source or something similar?
Consequently, group by is more expensive than order by in current MySQL releases. However, many other databases also consider using a hash table for the group by — in this case group by could also be faster than order by (as there is no hash-based sort algorithm).
Both GROUP BY and ORDER BY are clauses (or statements) that serve similar functions; that is to sort query results. However, each of these serve very different purposes; so different in fact, that they can be employed separately or together.
Yes, it is possible to use MySQL GROUP BY clause with multiple columns just as we can use MySQL DISTINCT clause. Consider the following example in which we have used DISTINCT clause in first query and GROUP BY clause in the second query, on 'fname' and 'Lname' columns of the table named 'testing'.
GROUPING is used to distinguish the null values that are returned by ROLLUP, CUBE or GROUPING SETS from standard null values. The NULL returned as the result of a ROLLUP, CUBE or GROUPING SETS operation is a special use of NULL. This acts as a column placeholder in the result set and means all.
One very handy way to "group by" some field (or set of fields and expressions, but I'll use "field" for simplicity!-) is when you can arrange to walk over the results before grouping (RBG) in a sorted way -- you actually don't care about the sorting (save in the common case in which an ORDER BY is also there and just happens to be on the same field as the GROUP BY!-), but rather about the "side effect" property of ordering -- that all rows in RBG with the same value for the grouping field come right after each other, so you can accumulate until the grouping field changes, then emit/yield the results accumulated so far, and proceed to reinitialize the accumulators with the new row (the one with a different value of the grouping field) -- make sure to "just initialize the accumulators" at the very start, AND "just emit/yield accumulated results" at the very end, of course.
If this doesn't work, maybe you can hash the grouping field and use a hash table for the results being accumulated for that group -- at each row in RBG, hash the grouping field, check if it was already present as a key in the hash table, if not put it there with accumulators suitably initialized from the RBG row, else update the accumulators per the RBG row. You just emit everything at the end. The problem of course is you're taking up more memory until the end!-)
These are the two fundamental approaches. Would you like pseudocode for each, BTW?
You should check out OLAP databases. OLAP allows you to create a database of aggregates meant to be analyzed in a "slice and dice" fashion.
Aggregate measures such as counts, averages, mins, maxs, sums and stdev's can be quickly analyzed by any number of dimensions using an OLAP database.
See this introduction to OLAP on MSDN.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With