Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently find top-N values from multiple columns independently in Oracle

Tags:

sql

oracle

top-n

Suppose I have 30 billion rows with multiple columns, and I want to efficiently find the top N most-frequent values for each column independently, and with the most elegant SQL possible. For example, if I have

FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Ferris    Freemont Possum         Ubik
Nancy     Freemont Lemur          Housekeeping
Nancy     Drew     Penguin        Ubik
Bill      Ribbits  Lemur          Dhalgren

and I want the top-1, then the result would be:

FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Nancy     Freemont Lemur          Ubik

I can probably think of ways to do it, but not sure if they're optimal, which is important when there are 30 billion rows; and the SQL might be big and ugly, and maybe too much temp space would be used.

Using Oracle.

like image 796
Joda Avatar asked Jan 19 '23 20:01

Joda


1 Answers

This should only do one pass over the table. You can use the analytic version of count() to get the frequency of each value independently:

select firstname, count(*) over (partition by firstname) as c_fn,
    lastname, count(*) over (partition by lastname) as c_ln,
    favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
    favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table;

FIRSTN C_FN LASTNAME C_LN FAVORIT C_FA FAVORITEBOOK C_FB
------ ---- -------- ---- ------- ---- ------------ ----
Bill      1 Ribbits     1 Lemur      2 Dhalgren        1
Ferris    1 Freemont    2 Possum     1 Ubik            2
Nancy     2 Freemont    2 Lemur      2 Housekeeping    1
Nancy     2 Drew        1 Penguin    1 Ubik            2

You can then use that as a CTE (or subquery factoring, I think in oracle terminology) and pull only the highest-frequency value from each column:

with tmp_tab as (
    select /*+ MATERIALIZE */
        firstname, count(*) over (partition by firstname) as c_fn,
        lastname, count(*) over (partition by lastname) as c_ln,
        favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
        favoritebook, count(*) over (partition by favoritebook) as c_fb
    from my_table)
select (select firstname from (
        select firstname,
            row_number() over (partition by null order by c_fn desc) as r_fn
            from tmp_tab
        ) where r_fn = 1) as firstname,
    (select lastname from (
        select lastname,
            row_number() over (partition by null order by c_ln desc) as r_ln
        from tmp_tab
        ) where r_ln = 1) as lastname,
    (select favoriteanimal from (
        select favoriteanimal,
            row_number() over (partition by null order by c_fa desc) as r_fa
        from tmp_tab
        ) where r_fa = 1) as favoriteanimal,
    (select favoritebook from (
        select favoritebook,
            row_number() over (partition by null order by c_fb desc) as r_fb
        from tmp_tab
        ) where r_fb = 1) as favoritebook
from dual;

FIRSTN LASTNAME FAVORIT FAVORITEBOOK
------ -------- ------- ------------
Nancy  Freemont Lemur   Ubik

You're doing one pass over the CTE for each column, but that should still only hit the real table once (thanks to the materialize hint). And you may want to add to the order by clauses to tweak what do to if there are ties.

This is similar in concept to what Thilo, ysth and others have suggested, except you're letting Oracle keep track of all the counting.

Edit: Hmm, explain plan shows it doing four full table scans; may need to think about this a bit more... Edit 2: Adding the (undocumented) MATERIALIZE hint to the CTE seems to resolve this; it's creating a transient temporary table to hold the results, and only does one full table scan. The explain plan cost is higher though - at least on this time sample data set. Be interested in any comments on any downside to doing this.

like image 191
Alex Poole Avatar answered Jan 30 '23 23:01

Alex Poole