How to properly access dbutils in Scala when using Databricks Connect

Tags:

I'm using Databricks Connect to run code in my Azure Databricks cluster locally from IntelliJ IDEA (Scala).

Everything works fine. I can connect, debug, inspect locally in the IDE.

I created a Databricks Job to run my custom app JAR, but it fails with the following exception:

19/08/17 19:20:26 ERROR Uncaught throwable from user code: java.lang.NoClassDefFoundError: com/databricks/service/DBUtils$
at Main$.<init>(Main.scala:30)
at Main$.<clinit>(Main.scala)

Line 30 of my Main.scala class is

val dbutils: DBUtils.type = com.databricks.service.DBUtils

Just like how it's described on this documentation page

That pages shows a way to access DBUtils that works both locally and in the cluster. But the example only shows Python, and I'm using Scala.

What's the proper way to access it in a way that works both locally using databricks-connect and in a Databricks Job running a JAR?

UPDATE

It seems there are two ways of using DBUtils.

1) The DbUtils class described here. Quoting the docs, this library allows you to build and compile the project, but not run it. This doesn't let you run your local code on the cluster.

2) The Databricks Connect described here. This one allows you to run your local Spark code in a Databricks cluster.

The problem is that these two methods have different setups and package name. There doesn't seem to be a way to use Databricks Connect locally (which is not available in the cluster) but then have the jar application using the DbUtils class added via sbt/maven so that the cluster has access to it.

711

asked Nov 19 '19 19:11

empz

1 Answers

I don't know why the docs you mentioned don't work. Maybe you're using a different dependency?

These docs have an example application you can download. It's a project with a very minimal test, so it doesn't create jobs or tries to run them on the cluster -- but it's a start. Also, please note that it uses the older 0.0.1 version of dbutils-api.

So to fix your current issue, instead of using com.databricks.service.DBUtils, try importing the dbutils from a different place:

import com.databricks.dbutils_v1.DBUtilsHolder.dbutils

Or, if you prefer:

import com.databricks.dbutils_v1.{DBUtilsV1, DBUtilsHolder}

type DBUtils = DBUtilsV1
val dbutils: DBUtils = DBUtilsHolder.dbutils

Also, make sure that you have the following dependency in SBT (maybe try to play with versions if 0.0.3 doesn't work -- the latest one is 0.0.4):

libraryDependencies += "com.databricks" % "dbutils-api_2.11" % "0.0.3"

This question and answer pointed me in the right direction. The answer contains a link to a working Github repo which uses dbutils: waimak. I hope that this repo could aid you in further questions about Databricks config and dependencies.

Good luck!

UPDATE

I see, so we have two similar but not identical APIs, and no good way to switch between the local and the backend version (though Databricks Connect promises that it should work anyhow). Please let me propose a workaround.

It's good that Scala is convenient for writing adapters. Here's a code snippet which should work as a bridge -- there's the DBUtils object defined in here which provides a sufficient API abstraction for the two versions of the API: the Databricks Connect one on com.databricks.service.DBUtils, and the backend com.databricks.dbutils_v1.DBUtilsHolder.dbutils API. We're able to achieve that by both loading and subsequently using the com.databricks.service.DBUtils through reflection -- we don't have hard-coded imports of it.

package com.example.my.proxy.adapter

import org.apache.hadoop.fs.FileSystem
import org.apache.spark.sql.catalyst.DefinedByConstructorParams

import scala.util.Try

import scala.language.implicitConversions
import scala.language.reflectiveCalls


trait DBUtilsApi {
    type FSUtils
    type FileInfo

    type SecretUtils
    type SecretMetadata
    type SecretScope

    val fs: FSUtils
    val secrets: SecretUtils
}

trait DBUtils extends DBUtilsApi {
    trait FSUtils {
        def dbfs: org.apache.hadoop.fs.FileSystem
        def ls(dir: String): Seq[FileInfo]
        def rm(dir: String, recurse: Boolean = false): Boolean
        def mkdirs(dir: String): Boolean
        def cp(from: String, to: String, recurse: Boolean = false): Boolean
        def mv(from: String, to: String, recurse: Boolean = false): Boolean
        def head(file: String, maxBytes: Int = 65536): String
        def put(file: String, contents: String, overwrite: Boolean = false): Boolean
    }

    case class FileInfo(path: String, name: String, size: Long)

    trait SecretUtils {
        def get(scope: String, key: String): String
        def getBytes(scope: String, key: String): Array[Byte]
        def list(scope: String): Seq[SecretMetadata]
        def listScopes(): Seq[SecretScope]
    }

    case class SecretMetadata(key: String) extends DefinedByConstructorParams
    case class SecretScope(name: String) extends DefinedByConstructorParams
}

object DBUtils extends DBUtils {

    import Adapters._

    override lazy val (fs, secrets): (FSUtils, SecretUtils) = Try[(FSUtils, SecretUtils)](
        (ReflectiveDBUtils.fs, ReflectiveDBUtils.secrets)    // try to use the Databricks Connect API
    ).getOrElse(
        (BackendDBUtils.fs, BackendDBUtils.secrets)    // if it's not available, use com.databricks.dbutils_v1.DBUtilsHolder
    )

    private object Adapters {
        // The apparent code copying here is for performance -- the ones for `ReflectiveDBUtils` use reflection, while
        // the `BackendDBUtils` call the functions directly.

        implicit class FSUtilsFromBackend(underlying: BackendDBUtils.FSUtils) extends FSUtils {
            override def dbfs: FileSystem = underlying.dbfs
            override def ls(dir: String): Seq[FileInfo] = underlying.ls(dir).map(fi => FileInfo(fi.path, fi.name, fi.size))
            override def rm(dir: String, recurse: Boolean = false): Boolean = underlying.rm(dir, recurse)
            override def mkdirs(dir: String): Boolean = underlying.mkdirs(dir)
            override def cp(from: String, to: String, recurse: Boolean = false): Boolean = underlying.cp(from, to, recurse)
            override def mv(from: String, to: String, recurse: Boolean = false): Boolean = underlying.mv(from, to, recurse)
            override def head(file: String, maxBytes: Int = 65536): String = underlying.head(file, maxBytes)
            override def put(file: String, contents: String, overwrite: Boolean = false): Boolean = underlying.put(file, contents, overwrite)
        }

        implicit class FSUtilsFromReflective(underlying: ReflectiveDBUtils.FSUtils) extends FSUtils {
            override def dbfs: FileSystem = underlying.dbfs
            override def ls(dir: String): Seq[FileInfo] = underlying.ls(dir).map(fi => FileInfo(fi.path, fi.name, fi.size))
            override def rm(dir: String, recurse: Boolean = false): Boolean = underlying.rm(dir, recurse)
            override def mkdirs(dir: String): Boolean = underlying.mkdirs(dir)
            override def cp(from: String, to: String, recurse: Boolean = false): Boolean = underlying.cp(from, to, recurse)
            override def mv(from: String, to: String, recurse: Boolean = false): Boolean = underlying.mv(from, to, recurse)
            override def head(file: String, maxBytes: Int = 65536): String = underlying.head(file, maxBytes)
            override def put(file: String, contents: String, overwrite: Boolean = false): Boolean = underlying.put(file, contents, overwrite)
        }

        implicit class SecretUtilsFromBackend(underlying: BackendDBUtils.SecretUtils) extends SecretUtils {
            override def get(scope: String, key: String): String = underlying.get(scope, key)
            override def getBytes(scope: String, key: String): Array[Byte] = underlying.getBytes(scope, key)
            override def list(scope: String): Seq[SecretMetadata] = underlying.list(scope).map(sm => SecretMetadata(sm.key))
            override def listScopes(): Seq[SecretScope] = underlying.listScopes().map(ss => SecretScope(ss.name))
        }

        implicit class SecretUtilsFromReflective(underlying: ReflectiveDBUtils.SecretUtils) extends SecretUtils {
            override def get(scope: String, key: String): String = underlying.get(scope, key)
            override def getBytes(scope: String, key: String): Array[Byte] = underlying.getBytes(scope, key)
            override def list(scope: String): Seq[SecretMetadata] = underlying.list(scope).map(sm => SecretMetadata(sm.key))
            override def listScopes(): Seq[SecretScope] = underlying.listScopes().map(ss => SecretScope(ss.name))
        }
    }
}

object BackendDBUtils extends DBUtilsApi {
    import com.databricks.dbutils_v1

    private lazy val dbutils: DBUtils = dbutils_v1.DBUtilsHolder.dbutils
    override lazy val fs: FSUtils = dbutils.fs
    override lazy val secrets: SecretUtils = dbutils.secrets

    type DBUtils = dbutils_v1.DBUtilsV1
    type FSUtils = dbutils_v1.DbfsUtils
    type FileInfo = com.databricks.backend.daemon.dbutils.FileInfo

    type SecretUtils = dbutils_v1.SecretUtils
    type SecretMetadata = dbutils_v1.SecretMetadata
    type SecretScope = dbutils_v1.SecretScope
}

object ReflectiveDBUtils extends DBUtilsApi {
    // This throws a ClassNotFoundException when the Databricks Connection API isn't available -- it's much better than
    // the NoClassDefFoundError, which we would get if we had a hard-coded import of com.databricks.service.DBUtils .
    // As we're just using reflection, we're able to recover if it's not found.
    private lazy val dbutils: DBUtils =
        Class.forName("com.databricks.service.DBUtils$").getField("MODULE$").get().asInstanceOf[DBUtils]

    override lazy val fs: FSUtils = dbutils.fs
    override lazy val secrets: SecretUtils = dbutils.secrets

    type DBUtils = AnyRef {
        val fs: FSUtils
        val secrets: SecretUtils
    }

    type FSUtils = AnyRef {
        def dbfs: org.apache.hadoop.fs.FileSystem
        def ls(dir: String): Seq[FileInfo]
        def rm(dir: String, recurse: Boolean): Boolean
        def mkdirs(dir: String): Boolean
        def cp(from: String, to: String, recurse: Boolean): Boolean
        def mv(from: String, to: String, recurse: Boolean): Boolean
        def head(file: String, maxBytes: Int): String
        def put(file: String, contents: String, overwrite: Boolean): Boolean
    }

    type FileInfo = AnyRef {
        val path: String
        val name: String
        val size: Long
    }

    type SecretUtils = AnyRef {
        def get(scope: String, key: String): String
        def getBytes(scope: String, key: String): Array[Byte]
        def list(scope: String): Seq[SecretMetadata]
        def listScopes(): Seq[SecretScope]
    }

    type SecretMetadata = DefinedByConstructorParams { val key: String }

    type SecretScope = DefinedByConstructorParams { val name: String }
}

If you replace the val dbutils: DBUtils.type = com.databricks.service.DBUtils which you mentioned in your Main with val dbutils: DBUtils.type = com.example.my.proxy.adapter.DBUtils, everything should work as a drop-in replacement, both locally and remotely.

If you have some new NoClassDefFoundErrors, try adding specific dependencies to the JAR job, or maybe try rearranging them, changing the versions, or marking the dependencies as provided.

This adapter isn't pretty, and it uses reflection, but it should be good enough as a workaround, I hope. Good luck :)

173

answered Sep 20 '22 01:09

VBel

Related questions
                            
                                Extracting and accessing fields at compile time in Scala 3
                            
                                Is there a Scala version of NavigableMap?
                            
                                Filtering Scala's Parallel Collections with early abort when desired number of results found
                            
                                Connect emacs to a remote ensime server
                            
                                Automating REST API documentation for routes [closed]
                            
                                IntelliJ IDEA debug jumps inside instead of going over
                            
                                How will xml modularisation in scala 2.11 play with xml literals?
                            
                                IntelliJ doesn't seem to pickup certain sbt libraries, no code completion
                            
                                Are cyclic dependencies supported in SBT?
                            
                                specs/scalatest interaction issue in Play app
                            
                                How to show difference between what is given and expected in ScalaTest?
                            
                                How to maintain an immutable list when you impact object linked to each other into this list
                            
                                Shapeless define lens for base trait
                            
                                Server Architecture for hosting Java PLAY application in the cloud
                            
                                Cannot specialize a Scala method with specializable trait as return type
                            
                                Designing numbering system with unsigned and signed ints
                            
                                How to get accuracy precision, recall and ROC from cross validation in Spark ml lib?
                            
                                How to apply play-evolutions when running tests in play-framework?
                            
                                MongoError No primary node is available
                            
                                Flink: No operators defined in streaming topology. Cannot execute

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to properly access dbutils in Scala when using Databricks Connect

Tags:

scala

databricks

azure-databricks

databricks-connect

dbutils

empz

People also ask

1 Answers

VBel

Recent Activity

Donate For Us