Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark (JAVA) - dataframe groupBy with multiple aggregations?

I'm trying to write a groupBy on Spark with JAVA. In SQL this would look like

SELECT id, count(id) as count, max(date) maxdate
FROM table
GROUP BY id;

But what is the Spark/JAVA style equivalent of this query? Let's say the variable table is a dataframe, to see the relation to the SQL query. I'm thinking something like:

table = table.select(table.col("id"), (table.col("id").count()).as("count"), (table.col("date").max()).as("maxdate")).groupby("id")

Which is obviously incorrect, since you can't use aggregate functions like .count or .max on columns, only dataframes. So how is this done in Spark JAVA?

Thank you!

like image 504
lte__ Avatar asked Jul 15 '16 12:07

lte__


People also ask

How do I get other columns with Spark DataFrame groupBy?

1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.

How does group by key work in Spark?

In Spark, the groupByKey function is a frequently used transformation operation that performs shuffling of data. It receives key-value pairs (K, V) as an input, group the values based on key and generates a dataset of (K, Iterable ) pairs as an output.

How do you add two aggregate functions in PySpark?

In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data. So by this we can do multiple aggregations at a time. where, column_name_group is the column to be grouped.


1 Answers

You could do this with org.apache.spark.sql.functions:

import org.apache.spark.sql.functions;

table.groupBy("id").agg(
    functions.count("id").as("count"),
    functions.max("date").as("maxdate")
).show();
like image 191
Yuan JI Avatar answered Oct 26 '22 20:10

Yuan JI