Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkSQL errors when using SQL DATE function

In Spark I am trying to execute SQL queries on a temporary table derived from a data frame that I manually built by reading a csv file and converting the columns into the right data type.

Specifically, the table I'm talking about is the LINEITEM table from [TPC-H specification][1]. Unlike stated in the specification I am using TIMESTAMP rather than DATE because I've read that Spark does not support the DATE type.

In my single scala source file, after creating the data frame and registering a temporary table called "lineitem", I am trying to execute the following query:

val res = sqlContext.sql("SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01 00:00:00');")

When I submit the packaged jar using spark-submit, I get the following error:

Exception in thread "main" java.lang.RuntimeException: [1.75] failure: ``union'' expected but but `;' found

When I omit the semicolon and do the same thing, I get the following error:

Exception in thread "main" java.util.NoSuchElementException: key not found: date

Spark version is 1.4.0.

Does anyone have an idea what's the problem with these queries?

[1] http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpch2.17.1.pdf

like image 734
TheRealPetron Avatar asked Feb 06 '26 06:02

TheRealPetron


1 Answers

  1. SQL queries passed to SQLContext.sql shouldn't be delimited using semicolon - this the source of your first problem
  2. DATE UDF expects date in the YYYY-­MM-­DD form and DATE('1998-12-01 00:00:00') evaluates to null. As long as timestamp can be casted to DATE correct query string looks like this:

    "SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01')"
    
  3. DATE is a Hive UDF. It means you have to use HiveContext not a standard SQLContext - this is the source of your second problem.

    import org.apache.spark.sql.hive.HiveContext
    
    val sqlContext = new HiveContext(sc) // where sc is a SparkContext
    
  4. In Spark >= 1.5 it is also possible to use to_date function:

    import org.apache.spark.sql.functions.{lit, to_date}
    
    df.where(to_date($"shipdate") <= to_date(lit("1998-12-01")))
    
like image 106
zero323 Avatar answered Feb 09 '26 03:02

zero323



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!