Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why does spark appends 'WHERE 1=0' at the end of sql query

I am trying to execute a simple mysql query using Apache Spark and create a data frame. But for some reasons spark appends 'WHERE 1=0' at the end of the query which I want to execute and throws an exception stating 'You have an error in your SQL syntax'.

val spark = SparkSession.builder.master("local[*]").appName("rddjoin"). getOrCreate()
 val mhost = "jdbc:mysql://localhost:3306/registry"
val mprop = new java.util.Properties
mprop.setProperty("driver", "com.mysql.jdbc.Driver")mprop.setProperty("user", "root")
mprop.setProperty("password", "root")
val q= """select id from loaded_item"""
val res=spark.read.jdbc(mhost,q,mprop)
res.show(10)

And the exception is as below:

18/02/16 17:53:49 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'select id from loaded_item WHERE 1=0' at line 1
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
    at com.mysql.jdbc.Util.getInstance(Util.java:408)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3909)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2484)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
    at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:62)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:193)
    at GenerateReport$.main(GenerateReport.scala:46)
    at GenerateReport.main(GenerateReport.scala)
18/02/16 17:53:50 INFO SparkContext: Invoking stop() from shutdown hook
like image 564
sam N Avatar asked Feb 16 '18 12:02

sam N


People also ask

What does where 1 0 mean in SQL?

The condition 1=0 can be used to stop the query from returning any rows. It returns empty set.

How do I escape a single quote in Spark SQL?

Use Two Single Quotes For Every One Quote To Display The simplest method to escape single quotes in SQL is to use two single quotes. For example, if you wanted to show the value O'Reilly, you would use two quotes in the middle instead of one.

How does Spark SQL work?

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.

How many rows of data can Spark handle?

Let's do a quick strength testing of PySpark before moving forward so as not to face issues with increasing data size, On first testing, PySpark can perform joins and aggregation of 1.5Bn rows i.e ~1TB data in 38secs and 130Bn rows i.e ~60 TB data in 21 Mins.


1 Answers

The second parameter of your call to spark.read.jdbc is not correct. Instead of specifing a sql query, you should either use a table name qualified with schema or a valid SQL query with an alias. In your case this would be val q="registry.loaded_item". Another option if you want to provide addional parameters (maybe for a where statement) is to use the other versions of DataframeReader.jdbc.

Btw: the reason why you see the strange looking query WHERE 1=0 is that Spark tries to infer the schema of your data frame without loading any actual data. This query is guaranteed never to deliver any results, but the query result's metadata can be used by Spark to get the schema information.

like image 78
werner Avatar answered Nov 10 '22 08:11

werner