Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by and order by in Spark SQL

I am trying to access s3 data using a spark Application. I am applying Spark SQL to retrieve the data. It is not taking group by clause.

DataFrame summaryQuery=sql.sql("Select score from summary order by updationDate desc);
summaryQuery.groupBy("sessionId").count().show();
summaryQuery.show();

Also i am trying it directly

    DataFrame summaryQuery=sql.sql("Select score from summary group by sessionId order by updationDate desc);
summaryquery.show();

But in both the cases i am getting SQL exception.

Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'score' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

Please specify how can i query the data.

like image 248
hitttt Avatar asked Oct 20 '16 12:10

hitttt


People also ask

How do you do a groupBy in Spark?

Similar to SQL “GROUP BY” clause, Spark sql groupBy() function is used to collect the identical data into groups on DataFrame/Dataset and perform aggregate functions like count(),min(),max,avg(),mean() on the grouped data.

Does groupBy preserve order Spark?

No, sorting within groupByKey will not necessarily be maintained but this is notoriously difficult to reproduce in memory on one node.

What is the difference between order by and sort by in Spark?

Description. The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which guarantees a total order of the output.

What is the meaning of groupBy clause in Spark SQL?

Description. The GROUP BY clause is used to group the rows based on a set of specified grouping expressions and compute aggregations on the group of rows based on one or more specified aggregate functions.


1 Answers

In Spark SQL, We have to wrap it in a function either first(column_name) or last(column_name) or any aggregate function when column_name is not present in group by clause. It will take either first or last value from the fetched rows respectively.

DataFrame summaryQuery=sql.sql("Select first(score) from summary group by sessionId order by updationDate desc);
summaryquery.show();
like image 117
hitttt Avatar answered Oct 25 '22 13:10

hitttt