I'm a newbie on spark and spark sql and I was trying to make the example that is on Spark SQL website, just a simple SQL query after loading the schema and data from a JSON files directory, like this:
import sqlContext.createSchemaRDD
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = "/home/shaza90/Desktop/tweets_1428981780000"
val tweet = sqlContext.jsonFile(path).cache()
tweet.registerTempTable("tweet")
tweet.printSchema() //This one works fine
val texts = sqlContext.sql("SELECT tweet.text FROM tweet").collect().foreach(println)
The exception that I'm getting is this one:
java.lang.StackOverflowError
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
Update
I'm able to execute select * from tweet
but whenever I use a column name instead of * I get the error.
Any Advice?
Ask Question. Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.
Apache Spark is the most successful software of Apache Software Foundation and designed for fast computing. Several industries are using Apache Spark to find their solutions. PySpark SQL is a module in Spark which integrates relational processing with Spark's functional programming API.
Is Spark SQL a database? Spark SQL is not a database but a module that is used for structured data processing. It majorly works on DataFrames which are the programming abstraction and usually act as a distributed SQL query engine.
During the course of the project we discovered that Big SQL is the only solution capable of executing all 99 queries unmodified at 100 TB, can do so 3x faster than Spark SQL, while using far fewer resources.
This is SPARK-5009 and has been fixed in Apache Spark 1.3.0.
The issue was that to recognize keywords (like SELECT
) with any case, all possible uppercase/lowercase combinations (like seLeCT
) were generated in a recursive function. This recursion would lead to the StackOverflowError
you're seeing, if the keyword was long enough and the stack size small enough. (This suggests that if upgrading to Apache Spark 1.3.0 or later is not an option, you can use -Xss
to increase the JVM stack size as a workaround.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With