We found loading data with Spark's API from Oracle databases have been always slow since Spark 1.3 up to current Spark 2.0.1. The typical code is something in Java like this:
Map<String, String> options = new HashMap<String, String>();
options.put("url", ORACLE_CONNECTION_URL);
options.put("dbtable", dbTable);
options.put("batchsize", "100000");
options.put("driver", "oracle.jdbc.OracleDriver");
Dataset<Row> jdbcDF = sparkSession.read().options(options)
.format("jdbc")
.load().cache();
jdbcDF.createTempView("my");
//= sparkSession.sql(dbTable);
jdbcDF.printSchema();
jdbcDF.show();
System.out.println(jdbcDF.count());
One of our members ever tried to customize this part and he improved a lot at the time (Spark 1.3.0). But some part of the Spark core code became internal to Spark so this cannot be used after the version. Also, we see HADOOP's SQOOP is much faster than Spark for this part (but it writes to HDFS, which will needs a lot of work to be converted to Dataset for Spark uses). Writing to Oracle using Spark's Dataset write method seems to be good for us. It is puzzling why this happens!
Hope that helps! While sqoop is easier to use out of the box, the fact that it is based on MapReduce will likely mean that Spark is superior in some scenarios, and it should be your go-to option when you want to save the data as Parquet or ORC (not supported by sqoop).
Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query.
There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences. Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.
Sometimes, Spark runs slowly because there are too many concurrent tasks running. The capacity for high concurrency is a beneficial feature, as it provides Spark-native fine-grained sharing. This leads to maximum resource utilization while cutting down query latencies.
Well @Pau Z Wu already answered the question in the comments but the problem wasoptions.put("batchsize", "100000");
This needed to be options.put("fetchsize", "100000");
since fetch size deals with limiting the amount of rows retrived from the database at a time and would end up making the load time faster.
More information can be found here: https://docs.oracle.com/cd/A87860_01/doc/java.817/a83724/resltse5.htm
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With