Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using SparkR JVM to call methods from a Scala jar file

I wanted to be able to package DataFrames in a Scala jar file and access them in R. The end goal is to create a way to access specific and often-used database tables in Python, R, and Scala without writing a different library for each.

To do this, I made a jar file in Scala with functions that use the SparkSQL library to query the database and get the DataFrames I want. I wanted to be able to call these functions in R without creating another JVM, since Spark already runs on a JVM in R. However, the JVM Spark uses is not exposed in the SparkR API. To make it accessible and make Java methods callable, I modified "backend.R", "generics.R", "DataFrame.R", and "NAMESPACE" in the SparkR package and rebuilt the package:

In "backend.R" I made "callJMethod" and "createJObject" formal methods:

  setMethod("callJMethod", signature(objId="jobj", methodName="character"), function(objId, methodName, ...) {
  stopifnot(class(objId) == "jobj")
  if (!isValidJobj(objId)) {
    stop("Invalid jobj ", objId$id,
         ". If SparkR was restarted, Spark operations need to be re-executed.")
  }
  invokeJava(isStatic = FALSE, objId$id, methodName, ...)
})


  setMethod("newJObject", signature(className="character"), function(className, ...) {
  invokeJava(isStatic = TRUE, className, methodName = "<init>", ...)
})

I modified "generics.R" to also contain these functions:

#' @rdname callJMethod
#' @export
setGeneric("callJMethod", function(objId, methodName, ...) { standardGeneric("callJMethod")})

#' @rdname newJobject
#' @export
setGeneric("newJObject", function(className, ...) {standardGeneric("newJObject")})

Then I added exports for these functions to the NAMESPACE file:

export("cacheTable",
   "clearCache",
   "createDataFrame",
   "createExternalTable",
   "dropTempTable",
   "jsonFile",
   "loadDF",
   "parquetFile",
   "read.df",
   "sql",
   "table",
   "tableNames",
   "tables",
   "uncacheTable",
   "callJMethod",
   "newJObject")

This allowed me to call the Scala functions I wrote without starting a new JVM.

The scala methods I wrote return DataFrames, which are "jobj"s in R when returned, but a SparkR DataFrame is an environment + a jobj. To turn these jobj DataFrames into SparkR DataFrames, I used the dataFrame() function in "DataFrame.R", which I also made accessible following the steps above.

I was then able to access the DataFrame that I "built" in Scala from R and use all of SparkR's functions on that DataFrame. I was wondering if there was a better way to make such a cross-language library, or if there is any reason the Spark JVM should not be public?

like image 755
mfliu Avatar asked Oct 23 '15 20:10

mfliu


1 Answers

any reason the Spark JVM should not be public?

Probably more than one. Spark developers make serious efforts to provide a stable public API. Low details of the implementation, including the way how guest languages communicate with JVM, are simply not a part of the contract. It could be completely rewritten at any point without any negative impact on the users. If you decide to use it and there are backwards incompatible changes you're on your own.

Keeping internals private reduces the effort to maintain and support software. You simply don't have bother yourself with all possible ways in which user can abuse these.

a better way to make such a cross-language library

It is hard to say without knowing more about your use case. I see at least three options:

  • For starters R provides only a weak access control mechanisms. If any part of the API is internal you can always use ::: function to access it. As smart people say:

    It is typically a design mistake to use ::: in your code since the corresponding object has probably been kept internal for a good reason.

    but one thing for sure it is much better than modifying Spark source. As a bonus it clearly marks parts of your code which are particularly fragile an potentially unstable.

  • if all you want is to create DataFrames the simplest thing is to use raw SQL. It is clean, portable, requires no compilation, packaging and simply works. Assuming you have query string like below stored in the variable named q

    CREATE TEMPORARY TABLE foo
    USING org.apache.spark.sql.jdbc
    OPTIONS (
        url "jdbc:postgresql://localhost/test",
        dbtable "public.foo",
        driver "org.postgresql.Driver"
    )
    

    it can be used in R:

    sql(sqlContext, q)
    fooDF <- sql(sqlContext, "SELECT * FROM foo")
    

    Python:

    sqlContext.sql(q)
    fooDF = sqlContext.sql("SELECT * FROM foo")
    

    Scala:

    sqlContext.sql(q)
    val fooDF = sqlContext.sql("SELECT * FROM foo")
    

    or directly in Spark SQL.

  • finally you can use Spark Data Sources API for consistent and supported cross-platform access.

Out of these three I would prefer raw SQL, followed by Data Sources API for complex cases and leave internals as a last resort.

Edit (2016-08-04):

If you're interested in low level access to JVM there is relatively new package rstudio/sparkapi which exposes internal SparkR RPC protocol. It is hard to predict how it will evolve so use it at your own risk.

like image 177
zero323 Avatar answered Sep 22 '22 17:09

zero323