when I use DataFrame groupby like this:
df.groupBy(df("age")).agg(Map("id"->"count"))
I will only get a DataFrame with columns "age" and "count(id)",but in df,there are many other columns like "name".
In all,I want to get the result as in MySQL,
"select name,age,count(id) from df group by age"
What should I do when use groupby in Spark?
You can get the all columns of a Spark DataFrame by using df. columns , it returns an array of column names as Array[Stirng] .
The groupBy method is defined in the Dataset class. groupBy returns a RelationalGroupedDataset object where the agg() method is defined. Spark makes great use of object oriented programming! The RelationalGroupedDataset class also defines a sum() method that can be used to get the same result with less code.
Long story short in general you have to join aggregated results with the original table. Spark SQL follows the same pre-SQL:1999 convention as most of the major databases (PostgreSQL, Oracle, MS SQL Server) which doesn't allow additional columns in aggregation queries.
Since for aggregations like count results are not well defined and behavior tends to vary in systems which supports this type of queries you can just include additional columns using arbitrary aggregate like first
or last
.
In some cases you can replace agg
using select
with window functions and subsequent where
but depending on the context it can be quite expensive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With