Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Oracle SQL - Select First n Distinct Rows

Tags:

sql

oracle

In Oracle 12, we finally have a limit feature, so we can simply

select distinct columnname from mytable fetch first n rows only;

However, I am presently stuck with a previous version and would like to know how to achieve this result.

Ideally the query should return rows immediately, even for a huge table. It should return results as soon as the N distinct have been found, instead of processing all rows.

like image 309
Apollys supports Monica Avatar asked Feb 05 '23 15:02

Apollys supports Monica


2 Answers

I think using

.. where rownum < XXX

should work

Perhaps

select * from (select distinct columnname from mytable) where  rownum < XXX
like image 74
Ed Heal Avatar answered Feb 07 '23 04:02

Ed Heal


No version of Oracle has a native syntax that returns a distinct Top N in an optimal way. This feature must be manually created using a PL/SQL pipelined function.

Sample Schema

This script creates a table with one column, about 100 million rows, that uses roughly 1GB of space.

--drop table mytable purge;

create table mytable(columnname number not null) nologging;

insert /*+ append */ into mytable
select level from dual connect by level <= 100000;
commit;

begin
    for i in 1 .. 10 loop
        insert /*+ append */ into mytable select * from mytable;
        commit;
    end loop;
end;
/

begin
    dbms_stats.gather_table_stats(user, 'MYTABLE');
end;
/

--1.25GB.
select bytes/1024/1024/1024 gb from dba_segments where segment_name = 'MYTABLE';

Oracle 12c row limiting clause does not work well with distinct.

The new 12c syntax consistently runs in about 20 seconds to return a small amount of rows:

select distinct columnname from mytable fetch first 10 rows only;

The statement reads the entire table, hashes the entire table, and then grabs the top N rows:

explain plan for
select distinct columnname from mytable fetch first 10 rows only;

select * from table(dbms_xplan.display(format => 'basic'));

Plan hash value: 239985407

------------------------------------------
| Id  | Operation              | Name    |
------------------------------------------
|   0 | SELECT STATEMENT       |         |
|   1 |  VIEW                  |         |
|   2 |   WINDOW NOSORT STOPKEY|         |
|   3 |    VIEW                |         |
|   4 |     HASH UNIQUE        |         |
|   5 |      TABLE ACCESS FULL | MYTABLE |
------------------------------------------

The Oracle 11g version created by Ed Heal below surprisingly works better! It runs in about 12 seconds.

select * from (select distinct columnname from mytable) where  rownum < 10;

Even though it's much faster, 12 seconds is still horrible. Regardless of my CPU or I/O performance, the algorithms must be wrong if it's taking seconds instead of milliseconds.

Indeed, this plan looks a little better. It has a SORT GROUP BY STOPKEY low in the plan. This stops the query before it processes everything. But it's still stopping much too late. (Maybe Oracle is still reading the entire table but only sorting part of it?)

explain plan for
select * from (select distinct columnname from mytable) where  rownum < 10;

select * from table(dbms_xplan.display(format => 'basic'));

Plan hash value: 3842480186

-------------------------------------------
| Id  | Operation               | Name    |
-------------------------------------------
|   0 | SELECT STATEMENT        |         |
|   1 |  COUNT STOPKEY          |         |
|   2 |   VIEW                  |         |
|   3 |    SORT GROUP BY STOPKEY|         |
|   4 |     TABLE ACCESS FULL   | MYTABLE |
-------------------------------------------

Pipelined Functions

This is an ugly solution for several reasons. It requires new code and objects for different result sets. It may not scale well - the function has a collection to store previous results, what happens if that collection gets huge?

Every different result type requires new objects:

--Create an object to hold a record with the result columns.
--(Not necessary for this simple example since there's only one column, but will
-- be necessary if there are multiple columns.)
create or replace type columnname_rec is object
(
    columnname number
);

--Create an object to hold a table of the records.
create or replace type columnname_tab is table of columnname_rec;

And a different function that returns a different type:

--Function that returns the distinct Top N as soon as they are found.
create or replace function fast_distinct_top_n(p_n number, p_cursor in sys_refcursor) return columnname_tab pipelined is
    v_columnname number;
    v_distinct_count number := 0;

    type previous_values_type is table of varchar2(4000) index by varchar2(4000);
    v_previous_values previous_values_type;
begin
    loop
        --Get new value.
        fetch p_cursor into v_columnname;

        --If the new value does not exist...
        if not v_previous_values.exists(v_columnname) then
            --Save the new value.
            v_previous_values(v_columnname) := v_columnname;
            --Increment the counter.
            v_distinct_count := v_distinct_count + 1;
            --Return the value
            pipe row(columnname_rec(v_columnname));
            --Exit if the counter is more than the top N.
            exit when v_distinct_count >= p_n;
        end if;
    end loop;
end;
/

But finally we have created a query that returns distinct Top N results in milliseconds.

select * from table(fast_distinct_top_n(10, cursor(select * from mytable)));

If you can't stand creating so many objects there's probably a way to make it generic using Method4. But that solution would still be complicated.

like image 25
Jon Heller Avatar answered Feb 07 '23 05:02

Jon Heller