This is a pretty noob question.
I'm trying to learn about SparkSQL. I've been following the example described here: http://spark.apache.org/docs/1.0.0/sql-programming-guide.html
Everything works fine in the Spark-shell, but when I try to use sbt to build a batch version, I get the following error message:
object sql is not a member of package org.apache.spark
Unfortunately, I'm rather new to sbt, so I don't know how to correct this problem. I suspect that I need to include additional dependencies, but I can't figure out how.
Here is the code I'm trying to compile:
/* TestApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
case class Record(k: Int, v: String)
object TestApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val data = sc.parallelize(1 to 100000)
val records = data.map(i => new Record(i, "value = "+i))
val table = createSchemaRDD(records, Record)
println(">>> " + table.count)
}
}
The error is flagged on the line where I try to create a SQLContext.
Here is the content of the sbt file:
name := "Test Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
Thanks for the help.
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.
Test results: RDD's outperformed DataFrames and SparkSQL for certain types of data processing. DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage.
Unlike the PySpark RDD API, PySpark SQL provides more information about the structure of data and its computation. It provides a programming abstraction called DataFrames. A DataFrame is an immutable distributed collection of data with named columns. It is similar to a table in SQL.
As is often the case, the act of asking the question helped me figure out the answer. The answer is to add the following line in the sbt file.
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.0.0"
I also realized there is an additional problem in the little program above. There are too many arguments in the call to createSchemaRDD. That line should read as follows:
val table = createSchemaRDD(records)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With