Spark sql Dataframe - import sqlContext.implicits._

Tags:

apache-spark-sql

spark-dataframe

I have main that creates spark context:

    val sc = new SparkContext(sparkConf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._

Then creates dataframe and does filters and validations on the dataframe.

    val convertToHourly = udf((time: String) => time.substring(0, time.indexOf(':')) + ":00:00")

    val df = sqlContext.read.schema(struct).format("com.databricks.spark.csv").load(args(0))
    // record length cannot be < 2 
    .na.drop(3)
    // round to hours
    .withColumn("time",convertToHourly($"time"))

This works great.

BUT When I try moving my validations to another file by sending the dataframe to

function ValidateAndTransform(df: DataFrame) : DataFrame = {...}

that gets the Dataframe & does the validations and transformations: It seems like I need the

 import sqlContext.implicits._

To avoid the error: “value $ is not a member of StringContext” that happens on line: .withColumn("time",convertToHourly($"time"))

But to use the import sqlContext.implicits._ I also need the sqlContext either defined in the new file like so:

val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

or send it to the

function ValidateAndTransform(df: DataFrame) : DataFrame = {...}
function

I feel like the separation I'm trying to do to 2 files (main & validation) is not done correctly...

Any idea on how to design this? Or simply send the sqlContext to the function?

Thanks!

271

asked Sep 08 '15 09:09

Etti Gur

1 Answers

You can work with a singleton instance of the SQLContext. You can take a look at this example in the spark repository

/** Lazily instantiated singleton instance of SQLContext */
object SQLContextSingleton {

  @transient  private var instance: SQLContext = _

  def getInstance(sparkContext: SparkContext): SQLContext = {
    if (instance == null) {
      instance = new SQLContext(sparkContext)
    }
    instance
  }
}
...
//And wherever you want you can do
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._

answered Dec 28 '22 11:12

Marco

Related questions
                            
                                EntityTooLarge error when uploading a 5G file to Amazon S3
                            
                                Converting a Spark Dataframe to a Scala Map collection
                            
                                How to change the column type from String to Date in DataFrames?
                            
                                PySpark computing correlation
                            
                                How to update column based on a condition (a value in a group)?
                            
                                AuthorizationException: User not allowed to impersonate User
                            
                                How to CROSS JOIN 2 dataframe?
                            
                                Partition data for efficient joining for Spark dataframe/dataset
                            
                                Spark Option: inferSchema vs header = true
                            
                                Spark: Merge 2 dataframes by adding row index/number on both dataframes
                            
                                How to max value and keep all columns (for max records per group)? [duplicate]
                            
                                Difference between two DataFrames columns in pyspark
                            
                                How to split a column?
                            
                                get all the dates between two dates in Spark DataFrame
                            
                                How to merge two columns of a `Dataframe` in Spark into one 2-Tuple?
                            
                                BigQuery replaced most of my Spark jobs, am I missing something?
                            
                                Spark: Read an inputStream instead of File
                            
                                UnresolvedException: Invalid call to dataType on unresolved object when using DataSet constructed from Seq.empty (since Spark 2.3.0)
                            
                                Co-partitioned joins in spark SQL
                            
                                How to read records in JSON format from Kafka using Structured Streaming?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With