Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - Group by HAVING with dataframe syntax?

Tags:

What's the syntax for using a groupby-having in Spark without an sql/hiveContext? I know I can do

DataFrame df = some_df df.registreTempTable("df");     df1 = sqlContext.sql("SELECT * FROM df GROUP BY col1 HAVING some stuff") 

but how do I do it with a syntax like

df.select(df.col("*")).groupBy(df.col("col1")).having("some stuff") 

This .having() does not seem to exist.

like image 590
lte__ Avatar asked Aug 09 '16 10:08

lte__


People also ask

How do I get other columns with Spark DataFrame groupBy?

1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.

How does groupBy work in Spark?

When we perform groupBy() on Spark Dataframe, it returns RelationalGroupedDataset object which contains below aggregate functions. count() - Returns the count of rows for each group. mean() - Returns the mean of values for each group. max() - Returns the maximum of values for each group.

How do you count and group by in PySpark?

PySpark Groupby Count is used to get the number of records for each group. So to perform the count, first, you need to perform the groupBy() on DataFrame which groups the records based on single or multiple column values, and then do the count() to get the number of records for each group.

What is Take () in Spark?

take (num: int) → List[T][source] Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.


1 Answers

Yes, it doesn't exist. You express the same logic with agg followed by where:

df.groupBy(someExpr).agg(somAgg).where(somePredicate)  
like image 153
zero323 Avatar answered Sep 28 '22 08:09

zero323