Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Could not instantiate EventHubSourceProvider for Azure Databricks

Using the steps documented in structured streaming pyspark, I'm unable to create a dataframe in pyspark from the Azure Event Hub I have set up in order to read the stream data.

Error message is: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.eventhubs.EventHubsSourceProvider could not be instantiated

I have installed the Maven libraries (com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12 is unavailable) but none appear to work: com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6

As well as ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString) but the error message returned is:

java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V

The connection string is correct as it is also used in a console application that writes to the Azure Event Hub and that works.

Can someone point me in the right direction, please. Code in use is as follows:

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Event Hub Namespace Name
NAMESPACE_NAME = "*myEventHub*"
KEY_NAME = "*MyPolicyName*"
KEY_VALUE = "*MySharedAccessKey*"

# The connection string to your Event Hubs Namespace
connectionString = "Endpoint=sb://{0}.servicebus.windows.net/;SharedAccessKeyName={1};SharedAccessKey={2};EntityPath=ingestion".format(NAMESPACE_NAME, KEY_NAME, KEY_VALUE)

ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString

# For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted.
# ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)

df = spark \
  .readStream \
  .format("eventhubs") \
  .options(**ehConf) \
  .load()
like image 508
KMM Avatar asked May 17 '20 20:05

KMM


2 Answers

To resolve the issue, I did the following:

  1. Uninstall azure event hub library versions
  2. Install com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 library version from Maven Central
  3. Restart cluster
  4. Validate by re-running code provided in the question
like image 189
KMM Avatar answered Jan 04 '23 01:01

KMM


I received this same error when installing libraries with the version number com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.* on a Spark cluster running Spark 3.0 with Scala 2.12

For anyone else finding this via google - check if you have the correct Scala library version. In my case, my cluster is Spark v3 with Scala 2.12

Changing the "2.11" in the library version from the tutorial I was using to "2.12", so it matches my cluster runtime version, fixed the issue.

like image 31
Joon Avatar answered Jan 04 '23 02:01

Joon