I'm hitting very strange problem when trying to load JDBC DataFrame into Spark SQL.
I've tried several Spark clusters - YARN, standalone cluster and pseudo distributed mode on my laptop. It's reproducible on both Spark 1.3.0 and 1.3.1. The problem occurs in both spark-shell
and when executing the code with spark-submit
. I've tried MySQL & MS SQL JDBC drivers without success.
Consider following sample:
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/test"
val t1 = {
sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> "t1",
"partitionColumn" -> "id",
"lowerBound" -> "0",
"upperBound" -> "100",
"numPartitions" -> "50"
))
}
So far so good, the schema gets resolved properly:
t1: org.apache.spark.sql.DataFrame = [id: int, name: string]
But when I evaluate DataFrame:
t1.take(1)
Following exception occurs:
15/04/29 01:56:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.1.42): java.sql.SQLException: No suitable driver found for jdbc:mysql://<hostname>:3306/test
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:270)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:158)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:150)
at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:317)
at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When I try to open JDBC connection on executor:
import java.sql.DriverManager
sc.parallelize(0 until 2, 2).map { i =>
Class.forName(driver)
val conn = DriverManager.getConnection(url)
conn.close()
i
}.collect()
it works perfectly:
res1: Array[Int] = Array(0, 1)
When I run the same code on local Spark, it works perfectly too:
scala> t1.take(1)
...
res0: Array[org.apache.spark.sql.Row] = Array([1,one])
I'm using Spark pre-built with Hadoop 2.4 support.
The easiest way to reproduce the problem is to start Spark in pseudo distributed mode with start-all.sh
script and run following command:
/path/to/spark-shell --master spark://<hostname>:7077 --jars /path/to/mysql-connector-java-5.1.35.jar --driver-class-path /path/to/mysql-connector-java-5.1.35.jar
Is there a way to work this around? It looks like a severe problem, so it's strange that googling doesn't help here.
I am using spark-1.6.1 with SQL server, still faced the same issue. I had to add the library(sqljdbc-4.0.jar) to the lib in the instance and below line in conf/spark-dfault.conf
file.
spark.driver.extraClassPath lib/sqljdbc-4.0.jar
Apparently this issue has been recently reported:
https://issues.apache.org/jira/browse/SPARK-6913
The problem is in java.sql.DriverManager that doesn't see the drivers loaded by ClassLoaders other than bootstrap ClassLoader.
As a temporary workaround it's possible to add required drivers to boot classpath of executors.
UPDATE: This pull request fixes the problem: https://github.com/apache/spark/pull/5782
UPDATE 2: The fix merged to Spark 1.4
For writing data to MySQL
In spark 1.4.0, you have to load MySQL before writing into it because it loads drivers on load function but not on write function. We have to put jar on every worker node and set the path in spark-defaults.conf file on each node. This issue has been fixed in spark 1.5.0
https://issues.apache.org/jira/browse/SPARK-10036
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With