Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fetching rows at extremely high speed

I have very large table (hundreds of millions rows, contains numbers and strings) in Oracle and I need to read all content of this table, format it and write to file or any other resource. Generally my solution looks like this:

package my.odp;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.TimeUnit;
import java.lang.Throwable;
import java.sql.*;


public class Main {
public static volatile boolean finished = false;

public static void main(final String[] args) throws InterruptedException {
    final ArrayBlockingQueue<String> queue = new ArrayBlockingQueue<String>(10000);
    final Thread writeWorker = new Thread("ODP Writer") {
        public void run() {
            try {
                File targetFile = new File(args[0]);
                FileWriter fileWriter = new FileWriter(targetFile);
                BufferedWriter writer = new BufferedWriter(fileWriter);
                String str;
                try {
                    while (!finished) {
                        str = queue.poll(200, TimeUnit.MILLISECONDS);
                        if (str == null) {
                            Thread.sleep(50);
                            continue;
                        }
                        writer.write(str);
                        writer.write('\n');
                    }
                } catch (InterruptedException e) {
                    writer.close();
                    return;
                }
            }
            catch (Throwable e) {
                e.printStackTrace();
                return;
            }
        }
    };

    final Thread readerThread = new Thread("ODP Reader") {
        public void run() {
            try {
                Class.forName("oracle.jdbc.OracleDriver");
                Connection conn = DriverManager.getConnection("jdbc:oracle:thin:@//xxx.xxx.xxx.xxx:1521/orcl", "user", "pass");

                Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
                stmt.setFetchSize(500000);
                ResultSet rs = stmt.executeQuery("select * from src_schema.big_table_view");
                System.out.println("Fetching result");
                while (rs.next()) {
                    StringBuilder sb = new StringBuilder();
                    sb.append(rs.getString(1)).append('\t');//OWNER
                    sb.append(rs.getString(2)).append('\t');//OBJECT_NAME
                    sb.append(rs.getString(3)).append('\t');//SUBOBJECT_NAME
                    sb.append(rs.getLong(4)).append('\t');//OBJECT_ID
                    sb.append(rs.getLong(5)).append('\t');//DATA_OBJECT_ID
                    sb.append(rs.getString(6)).append('\t');//OBJECT_TYPE
                    sb.append(rs.getString(7)).append('\t');//CREATED
                    sb.append(rs.getString(8)).append('\t');//LAST_DDL_TIME
                    sb.append(rs.getString(9)).append('\t');//TIMESTAMP
                    sb.append(rs.getString(10)).append('\t');//STATUS
                    sb.append(rs.getString(11)).append('\t');//TEMPORARY
                    sb.append(rs.getString(12)).append('\t');//GENERATED
                    sb.append(rs.getString(13)).append('\t');//SECONDARY
                    sb.append(rs.getString(14)).append('\t');//NAMESPACE
                    sb.append(rs.getString(15));//EDITION_NAME
                    queue.put(sb.toString());
                }

                rs.close();
                stmt.close();
                conn.close();
                finished = true;
            } catch (Throwable e) {
                e.printStackTrace();
                return;
            }
        }
    };
    long startTime = System.currentTimeMillis();
    writeWorker.start();
    readerThread.start();
    System.out.println("Waiting for join..");
    writeWorker.join();
    System.out.println("Exit:"+ (System.currentTimeMillis() - startTime));
}

}

There're two threads: one for fetching rows from result set and one for writing string values. Measured loading speed was about 10Mb/s and in my case I need to make it 10 times faster. Profiler shows that the most time consuming methods are

oracle.jdbc.driver.OracleResultSetImpl.getString()

and

oracle.net.ns.Packet.receive()

Do you have any ideas how to make jdbc to load data much faster? Any ideas about query optimisation, string loading optimisation, tweaking JDBC driver or using another one, direct using oracle JDBC implementations, tweaking Oracle is appreciated.

UPDATE: I compiled and listed discussion results below:

  1. I've no access to DBMS server except connection to Oracle db and server can't connect to any external resource. Any dump and extraction utils which use server's or remote file system can't be applied, also it's impossible to install and use any external java or PL/SQL routines on server. Only connetion to execute queries - thats all.

  2. I used profiler and digged in Oracle JDBC driver. I found out that the most expencive operation is reading data, i.e. Socket.read(). All string fields are represented as one char array and have almost no influence on perfomance. Generally, I checked with profiler the whole app and Socket.read() is definitely the most expensive operation. Extracting fields, building strings, writing data consume almost nothing. The problem is only in reading data.

  3. Any optimisations in data representation on server side don't have real effect. Concatenating strings and converting timestamps have no result for performance.

  4. App was rewritten to have several reader threads which put ready data in writer queue. Each thread has its own connection, no pools are used because they slow down the extraction (I used UCP pool which is recommended by oracle and it consumes about 10% of execution time, so i gave up from it). Also result set fetchSize was increased because switching from default value (10) to 50000 gives up to 50% perfomance growth.

  5. I tested how multithreaded version works with 4 reading threads and found out that increasing readers count only slows the extraction. I tried to launch 2 instances where each of them has two readers and both worked the same time as single instance, i.e. double data extraction requires same time as single. Don't know why this happens, but it looks like oracle driver have some performance limitations. Application with 4 independent connections works slower then 2 App instances with 2 connections. (Profiler was used to ensure that driver's Socket.read() is still the main issue, all other parts works fine in multithreaded mode).

  6. I tried to fetch all data with SAS and it can perform same extraction 2 times faster then JDBC, both used single connection to Oracle and can't use any dump operations. Oracle ensures that JDBC thin driver is as fast as native one..

Maybe Oracle have another ways to perform fast extraction to remote host through ODBC or smth else?

like image 802
user3007501 Avatar asked Aug 16 '14 00:08

user3007501


2 Answers

Assuming you have already checked the basic network stuff like interfaces, firewalls, proxies, as also the DB server's hardware elements.

Option 1 :

Instead of :

Class.forName("oracle.jdbc.OracleDriver");
Connection conn = DriverManager.getConnection("jdbc:oracle:thin:@//xxx.xxx.xxx.xxx:1521/orcl", "user", "pass");

try using :

OracleDataSource ods = new OracleDataSource();
java.util.Properties prop = new java.util.Properties();
prop.setProperty("MinLimit", "2");
prop.setProperty("MaxLimit", "10");
String url = "jdbc:oracle:oci8:@//xxx.xxx.xxx.xxx:1521/orcl";
ods.setURL(url);
ods.setUser("USER");
ods.setPassword("PWD");
ods.setConnectionCachingEnabled(true);
ods.setConnectionCacheProperties (prop);
ods.setConnectionCacheName("ImplicitCache01");

More details here

Option 2 : Fetchsize

As strongly pointed by Stephen, the fetchsize seems too large.

And, for fetch size of 500,000 what is your -Xms and -Xmx. Also, in profiler, whats the highest heap size?

Option 3 : DB

  • Check indexes and query plan for src_schema.big_table_view

  • Is this a tool or an application system. If just a tool, you could add parallel degrees, index hints, partitioning etc based on DB systems capabilities

Option 4 : Threads

Say n < Number of cores on application server

You can start n Threads of writer, each configured to process a certain bucket e.g. thread1 processes 0 to 10000, writing to n different files, and once all theads done, post join, merge the files together preferably using a low level OS command.

That said, all this should never be pre-defined code like its now. 'n' and the buckets should be calculated at runtime. And creating number of threads more than what your system supports only screws up.

Option 5 :

Instead of

select * from src_schema.big_table_view

You could use

SELECT column1||CHR(9)||column2||CHR(9).....||columnN FROM src_schema.big_table_view

This avoids creating 500000 StringBuilders and Strings. (Assuming no other complex formatting involved). CHR(9) is the tab character.

Option 6 :

Meantime, you could also check with your DBA for any DB system issues and raise an SR with Oracle support.

like image 87
Rajeev Sreedharan Avatar answered Oct 09 '22 14:10

Rajeev Sreedharan


It looks like you have already found and tweaked the row prefetch parameter. However, according to the Oracle documentation:

"There is no maximum prefetch setting, but empirical evidence suggests that 10 is effective. Oracle has never observed a performance benefit to setting prefetch higher than 50. If you do not set the default row-prefetch value for a connection, then 10 is the default."

You are setting it to 500000. Try winding it back to around 50 ... as Oracle recommend. (Why? Well it could be that a hugely excessive prefetch size is causing the server or client to use excessive amounts of memory to buffer the prefetched data. That could have a "knock on effect" on other things, leading to reduced throughput.)

Reference (from the Oracle 10g documentation):

  • Oracle Row-Prefetching Limitations

You might be able to get a greater throughput by running simultaneous queries in multiple Java threads (e.g. on separate "sections" of the table), writing each resultset to a separate stream / file. But then you have the problem of stitching the output streams / files together. (And whether you do get an overall improvement will depend on the number of client and server-side cores, network and NIC capacity and disc I/O capacity.)

Apart from that, I can't think of any way to do this faster in Java. But you could try PL/SQL, or something lower level. (I'm not an Oracle expert. Talk to your DBAs.)

A factor of 10 speed up in Java is ... ambitious.

like image 25
Stephen C Avatar answered Oct 09 '22 15:10

Stephen C