How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

Tags:

I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS.

The aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs

The template scala code from AWS:

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    // @type: DataSource
    // @args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
    // @return: datasource0
    // @inputs: []
    val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
    // @type: ApplyMapping
    // @args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
    // @return: applymapping1
    // @inputs: [frame = datasource0]
    val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
    // @type: DataSink
    // @args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
    // @return: datasink2
    // @inputs: [frame = applymapping1]
    val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
    Job.commit()
  }
}

And the build.sbt I have started putting together for a local build:

name := "aws-glue-scala"

version := "0.1"

scalaVersion := "2.11.12"

updateOptions := updateOptions.value.withCachedResolution(true)

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"

The documentation for AWS Glue Scala API seems to outline similar functionality as is available in the AWS Glue Python library. So perhaps all that is required is to download and build the PySpark AWS Glue library and add it on the classpath? Perhaps possible since the Glue python library uses Py4J.

224

asked Mar 13 '18 10:03

James

2 Answers

Unfortunately, there are no libraries available for Scala glue API. Already contacted amazon support and they are aware about this problem. However, they didn't provide any ETA for delivering API jar.

132

answered Sep 28 '22 15:09

Natalia

now it supports, a recent release from AWS.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

answered Sep 28 '22 14:09

sri hari kali charan Tummala

Related questions
                            
                                Triple quotes in Java like Scala
                            
                                How to use spark to generate huge amount of random integers?
                            
                                How to remove parentheses around records when saveAsTextFile on RDD[(String, Int)]?
                            
                                Spark Multiclass Classification Example
                            
                                instantiate object with reflection using constructor arguments
                            
                                Combining JavaFX and Scala - is it possible?
                            
                                Getting public fields (and their respective values) of an Instance in Scala/Java
                            
                                How do you make a list with 100 1s in Scala 2.9
                            
                                Scala: Declaring method with generic type parameter
                            
                                IntelliJ Idea Scala files not available in 'New' context menu
                            
                                How to get the actor system reference from inside the actor
                            
                                How to set column names to toDF() function in spark dataframe using a string array?
                            
                                In Scala, what is the difference between using the `_` and using a named identifier?
                            
                                Better type checking on match in Scala
                            
                                Scala, how to read more than one integer in one line in and get them in one variable each?
                            
                                passing futures to whenReady fails
                            
                                Play 2 - Can't return Json object in Response
                            
                                What is the Scala type mapping for all Spark SQL DataType
                            
                                Clean Scala syntax for "Append optional value to Seq if it exists"
                            
                                Translate a Scala Type example to Haskell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

Tags:

scala

sbt

pyspark

aws-glue

James

People also ask

2 Answers

Natalia

sri hari kali charan Tummala

Recent Activity

Donate For Us