"Group By" and other database algorithms?

Tags:

I've written some very basic tools for grouping, pivoting, unioning and subtotaling datasets sourced from non DB sources (eg: CSV, OLTP systems). The "group by" methods sit at the core of most of these.

However i'm sure lot of work has been done in making efficient algorithms for grouping data... and i'm sure i'm not using them. And my Google-fu has completely failed to turn anything up.

Are there any good online sources or books describing the better methods to create grouped data?

Or should i just start looking at the MySQL source or something similar?

922

asked Jul 15 '09 03:07

Mark Nold

2 Answers

One very handy way to "group by" some field (or set of fields and expressions, but I'll use "field" for simplicity!-) is when you can arrange to walk over the results before grouping (RBG) in a sorted way -- you actually don't care about the sorting (save in the common case in which an ORDER BY is also there and just happens to be on the same field as the GROUP BY!-), but rather about the "side effect" property of ordering -- that all rows in RBG with the same value for the grouping field come right after each other, so you can accumulate until the grouping field changes, then emit/yield the results accumulated so far, and proceed to reinitialize the accumulators with the new row (the one with a different value of the grouping field) -- make sure to "just initialize the accumulators" at the very start, AND "just emit/yield accumulated results" at the very end, of course.

If this doesn't work, maybe you can hash the grouping field and use a hash table for the results being accumulated for that group -- at each row in RBG, hash the grouping field, check if it was already present as a key in the hash table, if not put it there with accumulators suitably initialized from the RBG row, else update the accumulators per the RBG row. You just emit everything at the end. The problem of course is you're taking up more memory until the end!-)

These are the two fundamental approaches. Would you like pseudocode for each, BTW?

answered Nov 11 '22 10:11

Alex Martelli

You should check out OLAP databases. OLAP allows you to create a database of aggregates meant to be analyzed in a "slice and dice" fashion.

Aggregate measures such as counts, averages, mins, maxs, sums and stdev's can be quickly analyzed by any number of dimensions using an OLAP database.

See this introduction to OLAP on MSDN.

answered Nov 11 '22 11:11

jn29098

Related questions
                            
                                How do I change location of Cassandra storage files?
                            
                                Pass select query from the client side
                            
                                Foreign key constraint on SailsJS
                            
                                JUnit test starts before H2´s RUNSCRIPT finishes
                            
                                How to turn an MSSQL database to SQLite database for Android
                            
                                Realm.io [Java] notifications - How to listen for changes only in certain Table?
                            
                                Why does a FLOAT give me a more accurate result than a DECIMAL?
                            
                                Architecture for Django models to implement Timetable(scheduling) functionality
                            
                                How do you make rake db:schema:dump have the charset and collation of the fields in schema.rb?
                            
                                Keeping Users Anonymous - Secure DB Only Option - General Thoughts?
                            
                                Returning ids of a table where all values of other table exist with this id using all() or exists()
                            
                                What is the best practice database design for transactions aggregation?
                            
                                slqlalchemy UniqueConstraint VS Index(unique=True)
                            
                                Replicate part of production django database to local or staging
                            
                                Any way to monitor Postgresql query changes in realtime using LISTEN & NOTIFY (or NodeJS)?
                            
                                sync data from mongoDB to firebase and vice-versa
                            
                                Caching/reusing a DB connection for later view usage
                            
                                How to subscribe to changes for a public database in CloudKit?
                            
                                How do you typically import data from a spreadsheet to multiple database columns?
                            
                                Difference between select from table directly and view

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

"Group By" and other database algorithms?

Tags:

algorithm

database

Mark Nold

People also ask

2 Answers

Alex Martelli

jn29098

Recent Activity

Donate For Us