Spark JDBC fetchsize option

Tags:

I currently have an application which is supposed to connect to different types of databases, run a specific query on that database using Spark's JDBC options and then write the resultant DataFrame to HDFS.

The performance was extremely bad for Oracle (didn't check for all of them). Turns out it was because of the fetchSize property which is 10 rows by default for Oracle. So I increased it to 1000 and the performance gain was quite visible. Then, I changed it to 10000 but then some of the tables started failing with an out of memory issue in the executor ( 6 executors, 4G memory each, 2G driver memory ).

My questions are :

Is the data fetched by Spark's JDBC persisted in executor memory for each run? Is there any way to un-persist it while the job is running?
Where can I get more information about the fetchSize property? I'm guessing it won't be supported by all JDBC drivers.
Are there any other things that I need to take care which are related to JDBC to avoid OOM errors?

773

asked Sep 15 '17 16:09

philantrovert

2 Answers

Fetch Size It's just a value for JDBC PreparedStatement.

You can see it in JDBCRDD.scala:

 stmt.setFetchSize(options.fetchSize)

You can read more about JDBC FetchSize here

One thing you can also improve is to set all 4 parameters, that will cause parallelization of reading. See more here. Then your reading can be splitted into many machines, so memory usage for every of them may be smaller.

For details which JDBC Options are supported and how, you must search for your Driver documentation - every driver may have it's own behaviour

141

answered Oct 21 '22 02:10

T. Gawęda

To answer @y2k-shubham's follow up question "do I pass it inside connectionProperties param", per the current docs the answer is "Yes", but note the lower-cased 's'.

fetchsize The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). This option applies only to reading.

answered Oct 21 '22 04:10

Ion Freeman

Related questions
                            
                                How to convert JavaPairRDD into HashMap
                            
                                Spark SQL unable to complete writing Parquet data with a large number of shards
                            
                                How to register Python function as UDF in SparkSQL in Java/Scala?
                            
                                Python vs Scala (for Spark jobs)
                            
                                Spark driver disassociated and removed by the master
                            
                                How to properly provide credentials for spark-redshift in EMR instances?
                            
                                LogisticRegressionModel prediction manually
                            
                                Disjoint sets on apache spark
                            
                                Speed up collaborative filtering for large dataset in Spark MLLib
                            
                                Spark load model and continue training
                            
                                PySpark: TypeError: 'Column' object is not callable
                            
                                Creating many, short-living SparkSessions
                            
                                Spark: saveAsTextFile() only creating SUCCESS file and no part file when writing to local filesystem
                            
                                pySpark: Get executor id

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark JDBC fetchsize option

Tags:

jdbc

apache-spark

apache-spark-sql

philantrovert

People also ask

2 Answers

T. Gawęda

Ion Freeman

Recent Activity

Donate For Us