I would like to create a Spark SQL DataFrame from the results of a query performed over CSV data (on HDFS) with Apache Drill. I successfully configured Spark SQL to make it connect to Drill via JDBC:
Map<String, String> connectionOptions = new HashMap<String, String>();
connectionOptions.put("url", args[0]);
connectionOptions.put("dbtable", args[1]);
connectionOptions.put("driver", "org.apache.drill.jdbc.Driver");
DataFrame logs = sqlc.read().format("jdbc").options(connectionOptions).load();
Spark SQL performs two queries: the first one to get the schema, and the second one to retrieve the actual data:
SELECT * FROM (SELECT * FROM dfs.output.`my_view`) WHERE 1=0
SELECT "field1","field2","field3" FROM (SELECT * FROM dfs.output.`my_view`)
The first one is successful, but in the second one Spark encloses fields within double quotes, which is something that Drill doesn't support, so the query fails.
Did someone managed to get this integration working?
Thank you!
Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD. This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.
Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored both in RDDs (Spark's distributed datasets) and in external sources. Spark SQL conveniently blurs the lines between RDDs and relational tables.
you can add JDBC Dialect for this and register the dialect before using jdbc connector
case object DrillDialect extends JdbcDialect {
def canHandle(url: String): Boolean = url.startsWith("jdbc:drill:")
override def quoteIdentifier(colName: java.lang.String): java.lang.String = {
return colName
}
def instance = this
}
JdbcDialects.registerDialect(DrillDialect)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With