Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split up a massive data query into multiple queries

Tags:

sql

select

oracle

I have to select all rows from a table with millions of rows (to preload a Coherence datagrid.) How do I split up this query into multiple queries that can be concurrently executed by multiple threads?

I first thought of getting a count of all records and doing:

SELECT ... 
WHERE ROWNUM BETWEEN (packetNo * packetSize) AND ((packetNo + 1) * packetSize)

but that didn't work. Now I'm stuck.

Any help will be very appreciated.

like image 933
Chris Avatar asked Feb 24 '23 15:02

Chris


1 Answers

If you have the Enterprise Edition license, the easiest way of achieving this objective is parallel query.

For one-off or ad hoc queries use the PARALLEL hint:

select /*+ parallel(your_table, 4) */  *
from your_table
/

The number in the hint is the number of slave queries you want to execute; in this case the database will run four threads.

If you want every query issued on the table to be parallelizable then permanently alter the table definition:

alter table your_table parallel (degree 4)
/

Note that the database won't always use parallel query; the optimizer will decide whether it's appropriate. Parallel query only works with full table scans or index range scans which cross multiple partitions.

There are a number of caveats. Parallel query requires us to have sufficient cores to satisfy the proposed number of threads; if we only have a single dual-core CPU setting a parallel degree of 16 isn't going to magically speed up the query. Also, we need spare CPU capacity; if the server is already CPU bound then parallel execution is only going to make things worse. Finally, the I/O and storage subsystems need to be capable of satisfying the concurrent demand; SANs can be remarkably unhelpful here.

As always in matters of performance, it is crucial to undertake some benchmarking against realistic volumes of data in a representative environment before going into production.


What if you don't have Enterprise Edition? Well, it is possible to mimic parallel execution by hand. Tom Kyte calls it "Do-It-Yourself Parallelism". I have used this technique myself, and it works well.

The key thing is to work out the total range ROWIDs which apply to the table, and split them across multiple jobs. Unlike some of the other solutions proposed in this thread, each job only selects the rows it needs. Mr Kyte summarized the technique in an old AskTom thread, including the vital split script: find it here.

Splitting the table and starting off threads is a manual task: fine as a one-off but rather tiresome to undertake frequently. So if you are running 11g release 2 you ought to know that there is a new PL/SQL package DBMS_PARALLEL_EXECUTE which automates this for us.

like image 98
APC Avatar answered Feb 26 '23 03:02

APC