How to get other columns when using Spark DataFrame groupby?

Tags:

when I use DataFrame groupby like this:

df.groupBy(df("age")).agg(Map("id"->"count"))

I will only get a DataFrame with columns "age" and "count(id)",but in df,there are many other columns like "name".

In all,I want to get the result as in MySQL,

"select name,age,count(id) from df group by age"

What should I do when use groupby in Spark?

290

asked Dec 22 '15 06:12

Psychevic

1 Answers

Long story short in general you have to join aggregated results with the original table. Spark SQL follows the same pre-SQL:1999 convention as most of the major databases (PostgreSQL, Oracle, MS SQL Server) which doesn't allow additional columns in aggregation queries.

Since for aggregations like count results are not well defined and behavior tends to vary in systems which supports this type of queries you can just include additional columns using arbitrary aggregate like first or last.

In some cases you can replace agg using select with window functions and subsequent where but depending on the context it can be quite expensive.

147

answered Sep 28 '22 01:09

zero323

Related questions
                            
                                using CASE in the WHERE clause
                            
                                How to get second-highest salary employees in a table
                            
                                How to combine results of two queries into a single dataset
                            
                                How to select data where a field has a min value in MySQL?
                            
                                Convert output of MySQL query to utf8
                            
                                Generate a range of dates using SQL
                            
                                How to get list of dates between two dates in mysql select query [duplicate]
                            
                                How to count the number of occurrences of a character in an Oracle varchar value?
                            
                                SQL correct way of joining if the other parameter is null
                            
                                What happens to an uncommitted transaction when the connection is closed?
                            
                                Get top first record from duplicate records having no unique identity
                            
                                Calling stored procedure from another stored procedure SQL Server
                            
                                What is the equivalent of varchar(max) in Oracle?
                            
                                How to add default value in SQLite?
                            
                                SQL not displaying null values on a not equals query?
                            
                                Convert INT to DATETIME (SQL)
                            
                                Is cut() style binning available in dplyr?
                            
                                How we can use CTE in subquery in sql server?
                            
                                Django F expressions joined field
                            
                                What is the best way to represent "Recurring Events" in database?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get other columns when using Spark DataFrame groupby?

Tags:

sql

dataframe

apache-spark

apache-spark-sql

Psychevic

People also ask

1 Answers

zero323

Recent Activity

Donate For Us