Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL Stackoverflow

I'm a newbie on spark and spark sql and I was trying to make the example that is on Spark SQL website, just a simple SQL query after loading the schema and data from a JSON files directory, like this:

import sqlContext.createSchemaRDD
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = "/home/shaza90/Desktop/tweets_1428981780000"
val tweet = sqlContext.jsonFile(path).cache()

tweet.registerTempTable("tweet")
tweet.printSchema() //This one works fine


val texts = sqlContext.sql("SELECT tweet.text FROM tweet").collect().foreach(println) 

The exception that I'm getting is this one:

java.lang.StackOverflowError

    at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

Update

I'm able to execute select * from tweet but whenever I use a column name instead of * I get the error.

Any Advice?

like image 707
Lisa Avatar asked May 02 '15 02:05

Lisa


People also ask

What is Spark SQL stack overflow?

Ask Question. Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

What is the difference between PySpark SQL and Spark SQL?

Apache Spark is the most successful software of Apache Software Foundation and designed for fast computing. Several industries are using Apache Spark to find their solutions. PySpark SQL is a module in Spark which integrates relational processing with Spark's functional programming API.

Is Spark SQL the same as SQL?

Is Spark SQL a database? Spark SQL is not a database but a module that is used for structured data processing. It majorly works on DataFrames which are the programming abstraction and usually act as a distributed SQL query engine.

Is Spark SQL faster than SQL?

During the course of the project we discovered that Big SQL is the only solution capable of executing all 99 queries unmodified at 100 TB, can do so 3x faster than Spark SQL, while using far fewer resources.


1 Answers

This is SPARK-5009 and has been fixed in Apache Spark 1.3.0.

The issue was that to recognize keywords (like SELECT) with any case, all possible uppercase/lowercase combinations (like seLeCT) were generated in a recursive function. This recursion would lead to the StackOverflowError you're seeing, if the keyword was long enough and the stack size small enough. (This suggests that if upgrading to Apache Spark 1.3.0 or later is not an option, you can use -Xss to increase the JVM stack size as a workaround.)

like image 105
Daniel Darabos Avatar answered Sep 24 '22 13:09

Daniel Darabos