Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble building a simple SparkSQL application

This is a pretty noob question.

I'm trying to learn about SparkSQL. I've been following the example described here: http://spark.apache.org/docs/1.0.0/sql-programming-guide.html

Everything works fine in the Spark-shell, but when I try to use sbt to build a batch version, I get the following error message: object sql is not a member of package org.apache.spark

Unfortunately, I'm rather new to sbt, so I don't know how to correct this problem. I suspect that I need to include additional dependencies, but I can't figure out how.

Here is the code I'm trying to compile:

/* TestApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

case class Record(k: Int, v: String)

object TestApp {
 def main(args: Array[String]) {
   val conf = new SparkConf().setAppName("Simple Application")
   val sc = new SparkContext(conf)
   val sqlContext = new org.apache.spark.sql.SQLContext(sc)
   import sqlContext._
   val data = sc.parallelize(1 to 100000)
   val records = data.map(i => new Record(i, "value = "+i))
   val table = createSchemaRDD(records, Record)
   println(">>> " + table.count)
 }
}

The error is flagged on the line where I try to create a SQLContext.

Here is the content of the sbt file:

name := "Test Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

Thanks for the help.

like image 718
Bill_L Avatar asked Jul 14 '14 17:07

Bill_L


People also ask

What is SparkSQL?

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.

Which is faster Spark SQL or Spark DataFrame?

Test results: RDD's outperformed DataFrames and SparkSQL for certain types of data processing. DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage.

What is the difference between PySpark and Spark SQL?

Unlike the PySpark RDD API, PySpark SQL provides more information about the structure of data and its computation. It provides a programming abstraction called DataFrames. A DataFrame is an immutable distributed collection of data with named columns. It is similar to a table in SQL.


1 Answers

As is often the case, the act of asking the question helped me figure out the answer. The answer is to add the following line in the sbt file.

libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.0.0"

I also realized there is an additional problem in the little program above. There are too many arguments in the call to createSchemaRDD. That line should read as follows:

val table = createSchemaRDD(records)
like image 185
Bill_L Avatar answered Jan 13 '23 17:01

Bill_L