Suppose I have 30 billion rows with multiple columns, and I want to efficiently find the top N most-frequent values for each column independently, and with the most elegant SQL possible. For example, if I have
FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Ferris Freemont Possum Ubik
Nancy Freemont Lemur Housekeeping
Nancy Drew Penguin Ubik
Bill Ribbits Lemur Dhalgren
and I want the top-1, then the result would be:
FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Nancy Freemont Lemur Ubik
I can probably think of ways to do it, but not sure if they're optimal, which is important when there are 30 billion rows; and the SQL might be big and ugly, and maybe too much temp space would be used.
Using Oracle.
This should only do one pass over the table. You can use the analytic version of count()
to get the frequency of each value independently:
select firstname, count(*) over (partition by firstname) as c_fn,
lastname, count(*) over (partition by lastname) as c_ln,
favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table;
FIRSTN C_FN LASTNAME C_LN FAVORIT C_FA FAVORITEBOOK C_FB
------ ---- -------- ---- ------- ---- ------------ ----
Bill 1 Ribbits 1 Lemur 2 Dhalgren 1
Ferris 1 Freemont 2 Possum 1 Ubik 2
Nancy 2 Freemont 2 Lemur 2 Housekeeping 1
Nancy 2 Drew 1 Penguin 1 Ubik 2
You can then use that as a CTE (or subquery factoring, I think in oracle terminology) and pull only the highest-frequency value from each column:
with tmp_tab as (
select /*+ MATERIALIZE */
firstname, count(*) over (partition by firstname) as c_fn,
lastname, count(*) over (partition by lastname) as c_ln,
favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table)
select (select firstname from (
select firstname,
row_number() over (partition by null order by c_fn desc) as r_fn
from tmp_tab
) where r_fn = 1) as firstname,
(select lastname from (
select lastname,
row_number() over (partition by null order by c_ln desc) as r_ln
from tmp_tab
) where r_ln = 1) as lastname,
(select favoriteanimal from (
select favoriteanimal,
row_number() over (partition by null order by c_fa desc) as r_fa
from tmp_tab
) where r_fa = 1) as favoriteanimal,
(select favoritebook from (
select favoritebook,
row_number() over (partition by null order by c_fb desc) as r_fb
from tmp_tab
) where r_fb = 1) as favoritebook
from dual;
FIRSTN LASTNAME FAVORIT FAVORITEBOOK
------ -------- ------- ------------
Nancy Freemont Lemur Ubik
You're doing one pass over the CTE for each column, but that should still only hit the real table once (thanks to the materialize
hint). And you may want to add to the order by
clauses to tweak what do to if there are ties.
This is similar in concept to what Thilo, ysth and others have suggested, except you're letting Oracle keep track of all the counting.
Edit: Hmm, explain plan shows it doing four full table scans; may need to think about this a bit more...
Edit 2: Adding the (undocumented) MATERIALIZE
hint to the CTE seems to resolve this; it's creating a transient temporary table to hold the results, and only does one full table scan. The explain plan cost is higher though - at least on this time sample data set. Be interested in any comments on any downside to doing this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With