Setting up a Spark SQL connection with Kerberos

I have a simple Java application that can connect and query my cluster using Hive or Impala using code like

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

...

Class.forName("com.cloudera.hive.jdbc41.HS2Driver");
Connection con = DriverManager.getConnection("jdbc:hive2://myHostIP:10000/mySchemaName;hive.execution.engine=spark;AuthMech=1;KrbRealm=myHostIP;KrbHostFQDN=myHostIP;KrbServiceName=hive");
Statement stmt = con.createStatement();

ResultSet rs = stmt.executeQuery("select * from foobar");

But now I want to try doing the same query but with Spark SQL. I'm having a hard time figuring out how to use the Spark SQL API though. Specifically how to setup the connection. I see examples of how to setup the Spark Session but it's unclear what values I need to provide for example

  SparkSession spark = SparkSession
  .builder()
  .appName("Java Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate();

How do I tell Spark SQL what Host and Port to use, what Schema to use, and how do I tell Spark SQL which authentication technique I'm using? For example I'm using Kerberos to authenticate.

The above Spark SQL code is from https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java

UPDATE:

I was able to make a little progress and I think I figured out how to tell the Spark SQL connection what Host and Port to use.

...

SparkSession spark = SparkSession
.builder()
.master("spark://myHostIP:10000")
.appName("Java Spark Hive Example")
.enableHiveSupport()
.getOrCreate();

And I added the following dependency in my pom.xml file

<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-hive_2.11</artifactId>
   <version>2.0.0</version>
</dependency>

With this update I can see that the connection is getting further but it appears it's now failing because I'm not authenticated. I need to figure out how to authenticate using Kerberos. Here's the relevant log data

2017-12-19 11:17:55.717  INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.util.Utils              : Successfully started service 'SparkUI' on port 4040.
2017-12-19 11:17:55.717  INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.ui.SparkUI              : Bound SparkUI to 0.0.0.0, and started at http://myHostIP:4040
2017-12-19 11:17:56.065  INFO 11912 --- [er-threadpool-0] s.d.c.StandaloneAppClient$ClientEndpoint : Connecting to master spark://myHostIP:10000...
2017-12-19 11:17:56.260  INFO 11912 --- [pc-connection-0] o.a.s.n.client.TransportClientFactory    : Successfully created connection to myHostIP:10000 after 113 ms (0 ms spent in bootstraps)
2017-12-19 11:17:56.354  WARN 11912 --- [huffle-client-0] o.a.s.n.server.TransportChannelHandler   : Exception in connection from myHostIP:10000

java.io.IOException: An existing connection was forcibly closed by the remote host

How do I connect to a SQL Server database using Kerberos?

Using Kerberos integrated authentication to connect to SQL Server. Beginning in Microsoft JDBC Driver 4.0 for SQL Server, an application can use the authenticationScheme connection property to indicate that it wants to connect to a database using type 4 Kerberos integrated authentication.

How do I use Kerberos credentials with the driver?

In addition to allowing the driver to acquire Kerberos credentials using the settings specified in the login module configuration file, the driver can use existing credentials. This can be useful when your application needs to create connections using more than one user's credentials.

How do I programmatically set the authentication scheme for Kerberos connections?

When using a datasource to create connections, you can programmatically set the authentication scheme using setAuthenticationScheme and (optionally) set the SPN for Kerberos connections using setServerSpn. A new logger has been added to support Kerberos authentication: com.microsoft.sqlserver.jdbc.internals.KerbAuthentication.

What is Kerberos constrained delegation in Microsoft JDBC driver?

Beginning in Microsoft JDBC Driver 6.2, the driver supports Kerberos Constrained Delegation. The delegated credential can be passed in as org.ietf.jgss.GSSCredential object, these credentials are used by driver to establish connection.

You can try to do Kerberos login before running connection:

        Configuration conf = new Configuration();
        conf.set("fs.hdfs.impl", DistributedFileSystem.class.getName());            
        conf.addResource(pathToHdfsSite);
        conf.addResource(pathToCoreSite);
        conf.set("hadoop.security.authentication", "kerberos");
        conf.set("hadoop.rpc.protection", "privacy");
        UserGroupInformation.setConfiguration(conf);
        UserGroupInformation.loginUserFromKeytab(ktUserName, ktPath);
        //your code here

ktUserName is principal here, like - [email protected] And you need to have core-site.xml, hdfs-site.xml and keytab at your machine to run this.

Setting up a Spark SQL connection with Kerberos

Tags:

java

apache-spark

apache-spark-sql

kerberos

Kyle Bridenstine

People also ask

1 Answers

Oleg

Recent Activity

Donate For Us

Setting up a Spark SQL connection with Kerberos

Tags:

java

apache-spark

apache-spark-sql

kerberos

Kyle Bridenstine

People also ask

1 Answers

Oleg

Related questions

Recent Activity

Donate For Us