reuse the result of a select expression in the "GROUP BY" clause?

Question

In MySQL, I can have a query like this:

select  
    cast(from_unixtime(t.time, '%Y-%m-%d %H:00') as datetime) as timeHour
    , ... 
from
    some_table t 
group by
    timeHour, ...
order by
    timeHour, ...

where timeHour in the GROUP BY is the result of a select expression.

But I just tried a query similar to that in Sqark SQL, and I got an error of

Error: org.apache.spark.sql.AnalysisException: 
cannot resolve '`timeHour`' given input columns: ...

My query for Spark SQL was this:

select  
      cast(t.unixTime as timestamp) as timeHour
    , ...
from
    another_table as t
group by
    timeHour, ...
order by
    timeHour, ...

Is this construct possible in Spark SQL?

mrsrinivas · Accepted Answer

Is this construct possible in Spark SQL?

Yes, It is. You can make it works in Spark SQL in 2 ways to use new column in GROUP BY and ORDER BY clauses

Approach 1 using sub query :

SELECT timeHour, someThing FROM (SELECT  
      from_unixtime((starttime/1000)) AS timeHour
    , sum(...)                          AS someThing
    , starttime
FROM
    some_table) 
WHERE
    starttime >= 1000*unix_timestamp('2017-09-16 00:00:00')
      AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00')
GROUP BY
    timeHour
ORDER BY
    timeHour
LIMIT 10;

Approach 2 using WITH // elegant way :

-- create alias 
WITH table_aliase AS(SELECT  
      from_unixtime((starttime/1000)) AS timeHour
    , sum(...)                          AS someThing
    , starttime
FROM
    some_table)

-- use the same alias as table
SELECT timeHour, someThing FROM table_aliase
WHERE
    starttime >= 1000*unix_timestamp('2017-09-16 00:00:00')
      AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00')
GROUP BY
    timeHour
ORDER BY
    timeHour
LIMIT 10;

Alternative using Spark DataFrame(wo SQL) API with Scala :

// This code may need additional import to work well

val df = .... //load the actual table as df

import org.apache.spark.sql.functions._

df.withColumn("timeHour", from_unixtime($"starttime"/1000))
  .groupBy($"timeHour")
  .agg(sum("...").as("someThing"))
  .orderBy($"timeHour")
  .show()

//another way - as per eliasah comment
df.groupBy(from_unixtime($"starttime"/1000).as("timeHour"))
  .agg(sum("...").as("someThing"))
  .orderBy($"timeHour")
  .show()

reuse the result of a select expression in the "GROUP BY" clause?

Tags:

mysql

scala

apache-spark

apache-spark-sql

spark-dataframe

leeyuiwah

1 Answers

mrsrinivas

Recent Activity

Donate For Us

reuse the result of a select expression in the "GROUP BY" clause?

Tags:

mysql

scala

apache-spark

apache-spark-sql

spark-dataframe

leeyuiwah

1 Answers

mrsrinivas

Related questions

Recent Activity

Donate For Us