I have a simple Java application that can connect and query my cluster using Hive or Impala using code like
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
...
Class.forName("com.cloudera.hive.jdbc41.HS2Driver");
Connection con = DriverManager.getConnection("jdbc:hive2://myHostIP:10000/mySchemaName;hive.execution.engine=spark;AuthMech=1;KrbRealm=myHostIP;KrbHostFQDN=myHostIP;KrbServiceName=hive");
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery("select * from foobar");
But now I want to try doing the same query but with Spark SQL. I'm having a hard time figuring out how to use the Spark SQL API though. Specifically how to setup the connection. I see examples of how to setup the Spark Session but it's unclear what values I need to provide for example
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate();
How do I tell Spark SQL what Host and Port to use, what Schema to use, and how do I tell Spark SQL which authentication technique I'm using? For example I'm using Kerberos to authenticate.
The above Spark SQL code is from https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java
UPDATE:
I was able to make a little progress and I think I figured out how to tell the Spark SQL connection what Host and Port to use.
...
SparkSession spark = SparkSession
.builder()
.master("spark://myHostIP:10000")
.appName("Java Spark Hive Example")
.enableHiveSupport()
.getOrCreate();
And I added the following dependency in my pom.xml file
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.0.0</version>
</dependency>
With this update I can see that the connection is getting further but it appears it's now failing because I'm not authenticated. I need to figure out how to authenticate using Kerberos. Here's the relevant log data
2017-12-19 11:17:55.717 INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.util.Utils : Successfully started service 'SparkUI' on port 4040.
2017-12-19 11:17:55.717 INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.ui.SparkUI : Bound SparkUI to 0.0.0.0, and started at http://myHostIP:4040
2017-12-19 11:17:56.065 INFO 11912 --- [er-threadpool-0] s.d.c.StandaloneAppClient$ClientEndpoint : Connecting to master spark://myHostIP:10000...
2017-12-19 11:17:56.260 INFO 11912 --- [pc-connection-0] o.a.s.n.client.TransportClientFactory : Successfully created connection to myHostIP:10000 after 113 ms (0 ms spent in bootstraps)
2017-12-19 11:17:56.354 WARN 11912 --- [huffle-client-0] o.a.s.n.server.TransportChannelHandler : Exception in connection from myHostIP:10000
java.io.IOException: An existing connection was forcibly closed by the remote host
Using Kerberos integrated authentication to connect to SQL Server. Beginning in Microsoft JDBC Driver 4.0 for SQL Server, an application can use the authenticationScheme connection property to indicate that it wants to connect to a database using type 4 Kerberos integrated authentication.
In addition to allowing the driver to acquire Kerberos credentials using the settings specified in the login module configuration file, the driver can use existing credentials. This can be useful when your application needs to create connections using more than one user's credentials.
When using a datasource to create connections, you can programmatically set the authentication scheme using setAuthenticationScheme and (optionally) set the SPN for Kerberos connections using setServerSpn. A new logger has been added to support Kerberos authentication: com.microsoft.sqlserver.jdbc.internals.KerbAuthentication.
Beginning in Microsoft JDBC Driver 6.2, the driver supports Kerberos Constrained Delegation. The delegated credential can be passed in as org.ietf.jgss.GSSCredential object, these credentials are used by driver to establish connection.
You can try to do Kerberos login before running connection:
Configuration conf = new Configuration();
conf.set("fs.hdfs.impl", DistributedFileSystem.class.getName());
conf.addResource(pathToHdfsSite);
conf.addResource(pathToCoreSite);
conf.set("hadoop.security.authentication", "kerberos");
conf.set("hadoop.rpc.protection", "privacy");
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab(ktUserName, ktPath);
//your code here
ktUserName is principal here, like - [email protected] And you need to have core-site.xml, hdfs-site.xml and keytab at your machine to run this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With