I am trying to execute a simple mysql query using Apache Spark and create a data frame. But for some reasons spark appends 'WHERE 1=0'
at the end of the query which I want to execute and throws an exception stating 'You have an error in your SQL syntax'
.
val spark = SparkSession.builder.master("local[*]").appName("rddjoin"). getOrCreate()
val mhost = "jdbc:mysql://localhost:3306/registry"
val mprop = new java.util.Properties
mprop.setProperty("driver", "com.mysql.jdbc.Driver")mprop.setProperty("user", "root")
mprop.setProperty("password", "root")
val q= """select id from loaded_item"""
val res=spark.read.jdbc(mhost,q,mprop)
res.show(10)
And the exception is as below:
18/02/16 17:53:49 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'select id from loaded_item WHERE 1=0' at line 1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.Util.getInstance(Util.java:408)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3909)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2484)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:193)
at GenerateReport$.main(GenerateReport.scala:46)
at GenerateReport.main(GenerateReport.scala)
18/02/16 17:53:50 INFO SparkContext: Invoking stop() from shutdown hook
The condition 1=0 can be used to stop the query from returning any rows. It returns empty set.
Use Two Single Quotes For Every One Quote To Display The simplest method to escape single quotes in SQL is to use two single quotes. For example, if you wanted to show the value O'Reilly, you would use two quotes in the middle instead of one.
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.
Let's do a quick strength testing of PySpark before moving forward so as not to face issues with increasing data size, On first testing, PySpark can perform joins and aggregation of 1.5Bn rows i.e ~1TB data in 38secs and 130Bn rows i.e ~60 TB data in 21 Mins.
The second parameter of your call to spark.read.jdbc
is not correct. Instead of specifing a sql query, you should either use a table name qualified with schema or a valid SQL query with an alias. In your case this would be val q="registry.loaded_item"
. Another option if you want to provide addional parameters (maybe for a where statement) is to use the other versions of DataframeReader.jdbc.
Btw: the reason why you see the strange looking query WHERE 1=0
is that Spark tries to infer the schema of your data frame without loading any actual data. This query is guaranteed never to deliver any results, but the query result's metadata can be used by Spark to get the schema information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With