Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I speed up queries against huge data warehouse tables with effective-dated data?

So I am querying some extremely large tables. The reason they are so large is because PeopleSoft inserts new records every time a change is made to some data, rather than updating existing records. In effect, its transactional tables are also a data warehouse.

This necessitates queries that have nested selects in them, to get the most recent/current row. They are both effective dated and within each date (cast to a day) they can have an effective sequence. Thus, in order to get the current record for user_id=123, I have to do this:

select * from sometable st
where st.user_id = 123
and st.effective_date = (select max(sti.effective_date) 
  from sometable sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence) 
  from sometable sti where sti.user_id = st.user_id
  and sti.effective_date = st.effective_date)

There are a phenomenal number of indexes on these tables, and I can't find anything else that would speed up my queries.

My trouble is that I often times want to get data about an individual from these tables for maybe 50 user_ids, but when I join my tables having only a few records in them with a few of these PeopleSoft tables, things just go to crap.

The PeopleSoft tables are on a remote database that I access through a database link. My queries tend to look like this:

select st.* from local_table lt, sometable@remotedb st
where lt.user_id in ('123', '456', '789')
and lt.user_id = st.user_id
and st.effective_date = (select max(sti.effective_date) 
  from sometable@remotedb sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence) 
  from sometable@remotedb sti where sti.user_id = st.user_id
  and sti.effective_date = st.effective_date)

Things get even worse when I have to join several PeopleSoft tables with my local table. Performance is just unacceptable.

What are some things I can do to improve performance? I've tried query hints to ensure that my local table is joined to its partner in PeopleSoft first, so it doesn't attempt to join all its tables together before narrowing it down to the correct user_id. I've tried the LEADING hint and toyed around with hints that tried to push the processing to the remote database, but the explain plan was obscured and just said 'REMOTE' for several of the operations and I had no idea what was going on.

Assuming I don't have the power to change PeopleSoft and the location of my tables, are hints my best choice? If I was joining a local table with four remote tables, and the local table joined with two of them, how would I format the hint so that my local table (which is very small -- in fact, I can just do an inline view to have my local table only be the user_ids I'm interested in) is joined first with each of the remote ones?

EDIT: The application needs real-time data so unfortunately a materialized view or other method of caching data will not suffice.

like image 384
aw crud Avatar asked Jun 24 '10 19:06

aw crud


2 Answers

Does refactoring your query something like this help at all?

SELECT *
  FROM (SELECT st.*, MAX(st.effective_date) OVER (PARTITION BY st.user_id) max_dt,
                     MAX(st.effective_sequence) OVER (PARTITION BY st.user_id, st.effective_date) max_seq
          FROM local_table lt JOIN sometable@remotedb st ON (lt.user_id = st.user_id)
         WHERE lt.user_id in ('123', '456', '789'))
 WHERE effective_date = max_dt
   AND effective_seq = max_seq;

I agree with @Mark Baker that performance joining over DB Links really can suck and you're likely to be limited in what you can accomplish with this approach.

like image 151
DCookie Avatar answered Oct 14 '22 06:10

DCookie


One approach would be to stick PL/SQL functions around everything. As an example

create table remote (user_id number, eff_date date, eff_seq number, value varchar2(10));

create type typ_remote as object (user_id number, eff_date date, eff_seq number, value varchar2(10));
.
/

create type typ_tab_remote as table of typ_remote;
.
/

insert into remote values (1, date '2010-01-02', 1, 'a');
insert into remote values (1, date '2010-01-02', 2, 'b');
insert into remote values (1, date '2010-01-02', 3, 'c');
insert into remote values (1, date '2010-01-03', 1, 'd');
insert into remote values (1, date '2010-01-03', 2, 'e');
insert into remote values (1, date '2010-01-03', 3, 'f');

insert into remote values (2, date '2010-01-02', 1, 'a');
insert into remote values (2, date '2010-01-02', 2, 'b');
insert into remote values (2, date '2010-01-03', 1, 'd');

create function show_remote (i_user_id_1 in number, i_user_id_2 in number) return typ_tab_remote pipelined is
    CURSOR c_1 is
    SELECT user_id, eff_date, eff_seq, value
    FROM
        (select user_id, eff_date, eff_seq, value, 
                        rank() over (partition by user_id order by eff_date desc, eff_seq desc) rnk
        from remote
        where user_id in (i_user_id_1,i_user_id_2))
    WHERE rnk = 1;
begin
    for c_rec in c_1 loop
        pipe row (typ_remote(c_rec.user_id, c_rec.eff_date, c_rec.eff_seq, c_rec.value));
    end loop;
    return;
end;
/

select * from table(show_remote(1,null));

select * from table(show_remote(1,2));

Rather than having user_id's passed individually as parameters, you could load them into a local table (eg a global temporary table). The PL/SQL would loop then through the table, doing the remote select for each row in the local table. No single query would have both local and remote tables. Effectively you would be writing your own join code.

like image 41
Gary Myers Avatar answered Oct 14 '22 05:10

Gary Myers