I'm trying to write a groupBy on Spark with JAVA. In SQL this would look like
SELECT id, count(id) as count, max(date) maxdate
FROM table
GROUP BY id;
But what is the Spark/JAVA style equivalent of this query? Let's say the variable table
is a dataframe, to see the relation to the SQL query. I'm thinking something like:
table = table.select(table.col("id"), (table.col("id").count()).as("count"), (table.col("date").max()).as("maxdate")).groupby("id")
Which is obviously incorrect, since you can't use aggregate functions like .count
or .max
on columns, only dataframes. So how is this done in Spark JAVA?
Thank you!
1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.
In Spark, the groupByKey function is a frequently used transformation operation that performs shuffling of data. It receives key-value pairs (K, V) as an input, group the values based on key and generates a dataset of (K, Iterable ) pairs as an output.
In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data. So by this we can do multiple aggregations at a time. where, column_name_group is the column to be grouped.
You could do this with org.apache.spark.sql.functions
:
import org.apache.spark.sql.functions;
table.groupBy("id").agg(
functions.count("id").as("count"),
functions.max("date").as("maxdate")
).show();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With